Imitation learning (IL) algorithms typically distill demonstrations into parametric policies to mimic expert behavior. However, with limited data and partial observability, such as in egocentric mobile manipulation, existing methods often struggle to generate accurate actions. To address these challenges, we propose ReMoBot, a few-shot, trajectory-conditioned imitation learning framework that directly Retrieves information from demonstrations to solve Mobile manipulation tasks with ego-centric visual observations. Leveraging vision foundation models, ReMoBot identifies relevant expert demonstrations by combining state-level similarity, history-aware trajectory alignment, and action-sequence consistency to disambiguate perceptually similar observations. The agent then selects appropriate control commands based on these retrieved demonstrations in a fully training-free manner. We evaluate ReMoBot on three mobile manipulation tasks using a Boston Dynamics Spot robot in both simulation and real-world settings. After benchmarking five approaches in simulation, we compare our method with two baselines trained directly on real-world data without sim-to-real transfer. With only 20 demonstrations per task, ReMoBot outperforms the baselines, achieving high success rates in Table Uncover (70%) and Gap Cover (80%), while also showing promising performance on the more challenging Curtain Open task in the real-world setting. Furthermore, ReMoBot generalizes across varying robot positions, object sizes, and material properties, highlighting its robustness in real-world deformable mobile manipulation.
In this work, we propose ReMoBot, a learning-free retrieval-based imitation method designed to efficiently solve mobile manipulation tasks with few expert demonstrations. To achieve this, we outline three main steps: 1) retrieval dataset generation, which creates a dataset by extracting visual features from the demonstrations using a vision-foundation model (VFM) based perception module; 2) retrieval process, where the agent identifies the similar expert observations and selects trajectories based on the robot executed trajectory; and 3) behavior retrieval stage, where the agent refines the retrieved behavior candidates to find the appropriate action for execution.
We designed three mobile manipulation tasks in real-world and simulation settings to evaluate the performance of DeMoBot. The details of each task are presented below.
Observation from camera
Learning mobile manipulation skills for complex tasks, such as partial observation mobile manipulation, from a few demonstrations is a challenging problem. This work introduces ReMoBot, a few-shot imitation learning framework that leverages a retrieval strategy with visual similarity constraints to solve tasks without additional training. ReMoBot integrates a visual foundation model as a feature extractor with a trajectory-aware action identification, enabling training-free imitation of expert demonstrations even under partial observability. To evaluate ReMoBot, we designed three real-world mobile manipulation tasks involving deformable fabrics with the Boston Dynamics Spot robot. Across all tasks, ReMoBot consistently outperformed both learning-based and retrieval-based baselines, effectively acquiring manipulation skills from a limited dataset. Furthermore, ReMoBot demonstrated generalization to varying environmental conditions, including robot initial position, object size, and materials. Moving forward, extending ReMoBot with explicit mechanisms for collision handling and incorporating online fine-tuning strategies could further enhance its adaptability and safety during deployment, while preserving data efficiency and generalizability.