A Contextual Bandit Approach for Learning to Plan in Environments with Probabilistic Goal Configurations

Sohan Rudra*, Saksham Goel*, Anirban Santara*, Claudio Gentile*, Laurent Perron, Fei Xia, Vikas Sindhwani, Carolina Parada, Gaurav Aggarwal

Google

Abstract

Object-goal navigation (Object-nav) entails searching, recognizing and navigating to a target object. Object-nav has been extensively studied by the Embodied-AI community, but most solutions are often restricted to considering static objects (e.g., television, fridge, etc.). We propose a modular framework for object-nav that is able to efficiently search indoor environments for not just static objects but also movable objects (e.g. fruits, glasses, phones, etc.) that frequently change their positions due to human interaction. Our contextual-bandit agent efficiently explores the environment by showing optimism in the face of uncertainty and learns a model of the likelihood of spotting different objects from each navigable location. The likelihoods are used as rewards in a weighted minimum latency solver to deduce a trajectory for the robot. We evaluate our algorithms in two simulated environments and a real-world setting, to demonstrate high sample efficiency and reliability.

Figure 1: Picture of our robot (from Everyday Robots) and the target objects studied in our experiment

Video Presentation

ICRA_supplementary_video_uncompressed.mp4

Approach

  1. Initialization

The robot is randomly initialized in the environment with the task of finding a given target object.

2. Sample Navigable Points

A set of reachable vantage points (green dots) are sampled across the entire environment using the current 2D occupancy map.

3. Estimate the importance of each vantage point through exploration

A Contextual-Bandit agent estimates the likelihood of spotting the target object from each vantage point using the principle of "optimism in the face of uncertainty" for efficient exploration.

Figure 1: Example heat map of the estimated likelihoods in a simulated kitchen environment along with ground truth likelihoods in red of the goal object occurring on the surface of each furniture (hatched) in random positions.

4. Weighted Minimum Latency Path Planning

A Weighted Minimum Latency Problem (WMLP) solver is used to generate an ordering of the vantage points taking into account their likelihood scores, the initial position of the robot and the geometry of the room.

Figure 2: Sample trajectory of the robot in our real-kitchen test environment. CP-SAT was used for solving the WMLP and a Multi-Layer Perceptron Neural Network was used in the Contextual Bandit for modelling the likelihood function.

The figure demonstrates that the algorithm does not create a trajectory greedily and instead makes sure that the nearby table with goal likelihood of 0.3 is checked out before heading to the table with goal likelihood of 0.5 in the distance. This trajectory is a direct result of Weighted Minimum Latency path planning with the learned importance scores for the vantage points sampled around furnitures.

5. Path Execution

The robot visits the vantage points in the planned order while inspecting its surroundings. As soon as it spots the object, it heads directly to it. Model Predictive Control (MPC) is used for motion planning in our experiments.

Demo

Real Office Kitchen Environment

Degenerate Case

Robot spots the object while scanning the environment from its initial position

Action: Robot directly heads to the object.

Most Common Case

Robot does not spot the object while scanning the environment from its initial position

Action: Robot invokes the planner that provides a sequence of vantage points (that are collision-free according to the most recent occupancy map of the environment). If the target object is spotted from any of these points, the robot heads to the object.

Handling Unforeseen Obstruction

Robot can not reach a vantage point due to obstruction by unforeseen obstacles

Action: Robot aborts that point and heads to the next vantage point in the planned sequence for safety.

Modelling the Likelihood Function with Contextual Bandit

In this section we show a sample training experiment in simulated Kitchen Environment 1 for target object "bottle". The hatched rectangles denote tables and the red numbers denote the ground-truth likelihood of the target object appearing in a random location on the surface of the corresponding table.

The top left panel (titled "Reward") shows the evolution of the spatial map of the likelihood of spotting the target object from different locations of the room over iterations of training. The top right panel (titled "Uncertainty") shows the corresponding evolution of the agent's uncertainty estimate. Our contextual bandit uses the principle of "optimism in the face of uncertainty" for exploration. So the uncertainty estimate is combined with the reward estimate to produce an "upper confidence" estimate (whose evolution is shown in the bottom left panel titled "UCB") that is used to guide exploration. The bottom right panel shows the agent's trajectories. The red dot denotes the agent's initial position. Each blue star denotes a sampled vantage point. The green triangle shows the location of the target object in each episode. The vantage points visited by the agent before it spots the target object are marked by their index of visitation.