Belief-Grounded Networks for Accelerated Robot Learning under Partial Observability

Hai Nguyen*, Brett Daley*, Xinchao Song, Christopher Amato, Robert Platt

Northeastern University, Boston, USA

Conference on Robot Learning (CoRL), 2020

Email: nguyen.hai1@northeastern.edu

Abstract

Many important robotics problems are partially observable in the sense that a single visual or force-feedback measurement is insufficient to reconstruct the state. Standard approaches involve learning a policy over beliefs or observation-action histories. However, both of these have drawbacks; it is expensive to track the belief online, and it is hard to learn policies directly over histories. We propose a method for policy learning under partial observability called the Belief-Grounded Network (BGN) in which an auxiliary belief-reconstruction loss incentivizes a neural network to concisely summarize its input history. Since the resulting policy is a function of the history rather than the belief, it can be executed easily at runtime. We compare BGN against several baselines on classic benchmark tasks as well as three novel robotic force-feedback tasks. BGN outperforms all other tested methods and its learned policies work well when transferred onto a physical robot.

Full text available at: arxiv.org/abs/2010.09170

Code: https://github.com/hai-h-nguyen/belief-grounded-network

CoRL presentation: www.youtube.com/watch?v=GsJMt--ZARQ

Method


In the above figure, we illustrate BGN by combining it with a standard actor-critic (A2C) agent that takes in histories (past actions and observations). Assuming that the actor and critic do not share parameters, we add two BGNs for the separate feature extractors. Black components represent a standard dual-network A2C agent, while blue components indicate the additional network heads for reconstructing the belief. FC stands for a fully-connected network (a linear combination). We assume that observations and actions are discrete, and therefore use softmax activation functions for the policy distribution (Distr.) and the reconstructed beliefs. In continuous environments, the softmax can be substituted for other families of distributions. The two added branch helps to learn useful features from the history for the given task. During deployment, they can be removed and leave alone the actor that can choose actions using only the history as the input.

By reconstructing the belief during simulated training, the agent learns to overcome uncertainty in its environment. Most notably, the BGN does not require beliefs during execution, making the trained policies amenable to physical systems. When transferred to a real-world robot, these policies masterfully completed all of our proposed manipulation tasks without any fine tuning or other adjustments. Our work focused on discrete-state environments, but future work could easily extend our method to continuous-state tasks. For example, the belief update could be approximated by various methods such as Particle Filter or Kalman Filter.

Experiments

We compare our method (use BGN for a normal A2C agent that takes in history denoted as Ah-Ch+BGN) with a number of following baselines:

  • Ah-Cs: Asymmetric Actor-Critic where the critic uses ground-truth states provided by the simulator, while the actor uses observation-action histories with a recurrent network.

  • Ah-Cb: An asymmetric agent similar to Ah-Cs, but the critic uses beliefs instead of states. Theoretically, this should reduce bias in the case where two different observation-action histories arrive at the same state.

  • Ab-Cb: Both the actor and the critic accept the belief as input; because the policy depends on the belief, this cannot be executed on a robot easily.

  • SARSOP: One of the leading offline POMDP planning methods that uses beliefs. This offers an approximate idea of the best attainable performance for each environment.

  • Random: An untrained agent that selects actions from a uniform distribution over the action space.

Classical POMDP domains

We begin by testing our method in four classic POMDP domains that are challenging due to their stochasticity and partial observability. They are Hallway, Hallway-2, RockSample[4x4], and RockSample[5x5].

We plot the mean learning curves (100-episode moving average) of all methods over 10 random seeds with standard deviations shaded in the figure on the left. Across all domains, we observe that the methods bifurcate into two groups according to their performances.

  • In the better-performing group, Ah-Ch + BGN and Ab-Cb perform similarly in each environment, matching SARSOP's performance in the Hallway domains and coming close to it in the RockSample domains.

  • In the other group, the performance order is inconsistent; while Ah-Cb's use of the belief helps it outperform Ah-Cs in Hallway and Hallway-2, it appears to not have a significant effect in the RockSample domains.

  • Additionally, despite not having access to any privileged information, Ah-Ch is able to match Ah-Cb and Ah-Cs on three of the four environments. This suggests that asymmetric architectures may have limited benefits under extreme levels of partial observability.

Force-feedback Robot Domains

Next, we test on three POMDP robot domains:

  • TopPlate (Left): The agent must locate and grasp the top plate from a stack of height random from 1-10 where the height of the stack is initially unknown to the agent. It locates the top plate using only finger position observations when the finger is in compliant mode. The agent's finger is positioned such that it touches the plates as it moves up and down in compliant mode. When ready, the agent can execute a command to grasp the plate that is adjacent to it. The episode terminates when a grasp action is performed.

  • TwoBumps-1D (Middle): Two movable bumps rest on a table, with the robot's finger moving along a horizontal line above them. The initial position of the robotic finger as well as the two objects are randomized uniformly such that the left-right order between the bumps is unchanged. The agent's goal is to push the rightmost bump to the right without disturbing the left bump. There are four action combinations: move left or right, each with a compliant or stiff finger. This task is challenging because the agent does not know initially which bump is rightmost - it must touch both bumps to determine this. Because the agent's motion is constrained one-dimensionally, it is not possible to miss the bumps. The robot must relax the stiffness of its finger when passing by a bump to avoid pushing the wrong one. The episode ends as soon as either bump moves.

  • TwoBumps-2D (Right): Two bumps of different sizes are randomly positioned on a 4x4 grid. The robotic finger is constrained to move in a plane above the bumps. The finger can be moved in any of the four directions or perform a grasp. The agent must make contact with each bump at least once and then grasp the larger bump to complete this task successfully, inferring the bumps' relative sizes from the angular displacement of the finger. The episode ends after a grasp is executed.

We plot the success rates (100-episode moving average) of the methods averaged over 10 random seeds with standard deviations shaded in the above figure. Ah-Ch + BGN is the only agent that can achieve a perfect success rate in all tasks. Ab-Cb does significantly worse in a surprising contrast to its classic POMDP performance. Nevertheless, Ab-Cb still performs better than Ah-Cs, Ah-Cb, and Ah-Ch which all have roughly the same bad performance. These methods appear to be unable to learn meaningful control policies for these tasks, especially in TopPlate and TwoBumps-2D, and their learning progress is highly unstable compared to Ah-Ch + BGN and Ab-Cb.

Hardware Evaluation

We transferred the policies learned in the three robot domains to a real robot. We use a 2-DoF gripper mounted on a UR5e robot arm, where an impedance controller modulates the compliance of the finger. The visualization of the policies and the video are shown below.

  • TwoBumps-1D: The compliant finger (yellow circle) unconditionally moves right at the beginning of the episode. There are three possible cases.

    • The finger encounters the first bump, becomes rigid after passing it, and then pushes the second bump to accomplish the goal.

    • Similar to the first case but the finger reaches the extremity of its motion range before finding the second bump. The agent realizes that the first bump must therefore be the target bump. It relaxes and backtracks to the left until it passes the bump again, stiffens, and then returns right to accomplish the goal.

    • The finger does not initially encounter any bump. It remains compliant and backtracks to the left until it passes a bump, stiffens, and then returns right to accomplish the goal.

  • TwoBumps-2D: The soft finger (yellow circle) explores the gridworld efficiently by following a non-intersecting path (green arrows) until it locates both the small and large bumps (white and black circles, respectively). When the agent encounters the second bump, there are two cases.

    • The bump is the large one and the agent grasps it immediately (t=5).

    • The bump is the small one; the agent traverses the shortest path back to the larger bump and grasps it (t=17).

  • TopPlate: There are two cases.

    • The number of plates < 10, the finger (yellow circle) goes upward until no plate is felt then it moves down one step to grasp the top plate.

    • When the number of plates =10, the agent discovers a shortcut: it grasps the tenth plate immediately after detecting it, indicating that it has learned to count past contact events.

Belief Visualization

In the figure above, we visualize the belief of our agent (top row) and the ground-truth belief (second row) at various times during execution in TwoBumps-1D. The left bump (red dot), the right bump (green dot), and the agent (yellow cross) move diagonally along the permissible locations (white dots). The colour of each cell represents the probability that the agent thinks that the two bumps are at that cell. Particular moment is at t=2 when the agent (yellow cross) is on top of the left bump (red). After passing the bump by observing a dramatic change in the angle, the agent is now certain about the possible of one bump. Its belief is then minimized to a single bounded horizontal line at t=3.