Actor Critic with Teacher Ensembles

Motivation

The exploration mechanism used by a Deep Reinforcement Learning (RL) agent plays a key role in determining its sample efficiency. Since collecting real-world experience is slow and expensive, improving over random exploration is crucial to solve long-horizon tasks with sparse rewards. While Imitation Learning (IL) can be used to help alleviate this problem, collecting large amounts of useful expert experience can be difficult and costly for long-horizon tasks. We argue that, in domains like robotics, a useful alternative is to encode knowledge into an ensemble of heuristic solutions (controllers, planners, previously trained policies, etc.) that address parts of the task.

We propose to leverage advice from this ensemble of partial solutions as teachers that guide the agent's exploration with suggestions to execute. We make minimal assumptions on the quality of such teacher sets and assume that the sets can have the following attributes:

    • partial: Individual teachers might only offer useful advice in certain states
    • contradictory: Teachers might offer advice that contradicts other teacher suggestions
    • insufficient: There might be states where no teacher offers useful advice, and the agent needs to learn optimal behavior from scratch

Visual representation of teacher ensemble attributes, with arrows representing actions and line color representing different teacher policies. In this figure each example trajectory has the attributes of all the boxes it is contained within. Italicized terms apply to single policies, and non-italicized terms refer to sets of policies.

Contributions

  • We formalize this collection of teacher attributes that comprehensively characterizes the quality of a set of teachers for guiding agent training.
  • We propose Actor-Critic with Teacher Ensembles (AC-Teach), a policy learning framework to leverage advice from multiple teachers that is robust to low-quality teacher ensembles that can contain partial teachers, contradictory teachers, and be insufficient for solving the task.
  • We demonstrate that AC-Teach is able to leverage such teacher ensembles to solve multi-step tasks while significantly improving sample complexity over baselines.
  • Furthermore, we show that AC-Teach is not only able to generate policies using low-quality teacher sets but also surpasses baselines when using higher quality teachers, hence providing a unified algorithmic solution to a broad range of teacher attributes.

Overview

1. Provide a set of heuristic-based partial solutions to a task of interest.

Teacher 1: Hook-Grasp

A teacher that tries to move to the hook handle and grasp it.

Teacher 2: Hook-Sweep

A teacher that tries to move the arm back in a sweeping motion, without assuming that the hook is grasped, or that the cube is aligned to the hook.

2. Run AC-Teach using the set of provided solutions to train an agent to solve the task.

The agent must learn to overcome suboptimalities in the teachers and also learn completely new behaviors for parts of the task where no teacher in the teacher set offered useful advice.

In this example, the arm must learn to position the hook with respect to the cube from scratch.

At test-time, only the agent policy is evaluated, so it must also learn how to accomplish the other parts of the task from the appropriate teachers. In this way, the agent is simultaneously performing imitation and reinforcement to learn the task.

Algorithm Overview

Environments and Teachers

1. Pick and Place Task





The goal is to pick the cube up and place it at the goal location.

Teacher 1: Pick

Teacher that tries to move to the cube and grasp it.

Teacher 2: Place

Teacher that tries to take a parabolic path to move the arm to the goal and release the cube. It is agnostic to whether the cube is grasped or not.

2. Hook Sweep Task





The goal is to leverage the hook to sweep the cube into the goal. Notice that the cube is always far from the robot, so the robot must use the hook to manipulate the cube.

Teacher 1: Hook-Grasp

Teacher that tries to move to the hook handle and grab it.

Teacher 2: Hook-Position

Teacher that assumes that the hook has already been grasped, and tries to move the arm to a position where the hook would be in a good position to sweep the cube into the goal. This teacher does not know if the hook has been grasped, and tries to position the arm regardless.

Teacher 3: Hook-Sweep

Teacher that executes a sweeping motion to try to sweep the cube into the goal. This teacher only makes task progress if the arm is holding the hook and has been positioned properly, but this teacher executes sweeping motions regardless.

3. Path Following Task





The goal is to follow a specific path of four waypoints located at the corner of a square. The waypoints must be visited in a particular order that is sampled randomly at the beginning of every episode.

Every waypoint has a corresponding partial teacher that knows how to reach that waypoint. The above video shows all teachers sequenced in the correct order.

Learned Qualitative Agent Behaviors

Path Following

The agent mostly learns to follow optimal straight line paths to the waypoints, despite the teachers having noisy actions and exhibiting imperfect behavior at train time.

Pick and Place

Notice how the agent learned to grasp the cube and then slide it to the goal, even though the place teacher actually lifts the cube up and tries to execute a parabolic arc to the goal (see Place teacher above). The agent can learn to exhibit behavior that is different than that of the teachers in order to maximize task performance.

Hook Sweep

This agent learned to recognize situations where it needs to use the hook to sweep the cube forward to the goal, and where it needs to use the hook to pull the cube back to the goal. It also doesn't bother hanging on to the hook when it realizes that the task has been completed (it's our agent's version of a mic drop).

Quantitative Agent Performance

The above animations show the performance of the agents trained with AC-Teach and a sufficient set of partial noisy teachers, and we present the quantitative plots of the performance of the agent throughout training below. Note that our algorithm exhibits faster and better convergence for each environment.

Path Following

Pick and Place

Hook Sweep

Why is leveraging low-quality teacher ensembles difficult?

Some teachers are only useful in certain regions of the state space, others may not be useful at all, and some might contradict other teachers. Here are some examples of what happens when a behavioral policy selects the wrong teacher to listen to. These examples lend insight into the challenges that the AC-Teach behavioral policy successfully overcomes.

Behavioral policies can be sensitive to contradictory teachers

When a behavioral policy has no notion of commitment, the experience it collects can be low quality due to teachers that offer contradictory advice. Below, we present evidence that behavioral policy baselines suffer from this shortcoming, while the AC-Teach behavioral policy is able to make effective use of the teachers.

Behavioral policies can be sensitive to partial teachers

When a behavioral policy overcommits to its choice of policy, the experience it collects can be low quality due to teachers that are partial. Below, we present evidence that behavioral policy baselines suffer from this shortcoming, while the AC-Teach behavioral policy is able to make effective use of the teachers.

Pick and Place

Baselines are sensitive to Contradictory Teachers

In the example above, the behavioral policy picks between the Pick teacher, Place teacher, and agent, uniformly at random, leading to indecisiveness between picking the cube and moving to the goal.

Baselines are sensitive to Partial Teachers

In the example above, the behavioral policy selects a policy and naively executes it for several timesteps. Although the Pick teacher allows the behavioral policy to grasp the cube, little else happens during the episode, since the behavioral policy does not realize that the Pick teacher is not useful when the cube is grasped. Similarly, the Place teacher is executed at the wrong time, when the cube has not been grasped yet.

AC-Teach Behavioral Policy is insensitive to both

The AC-Teach behavioral policy is able to commit to policy selections for appropriate time scales, allowing for improved exploration guided by the teachers. In the example above, the behavioral policy is able to switch its policy choice when the Pick teacher has helped it grasp the cube, allowing for improved exploration, and resulting in a successful episode of interaction. This provides useful experience for training the agent.

Hook Sweep

Baselines are sensitive to Contradictory Teachers

In the example above, the behavioral policy picks uniformly between the teachers and the agent, leading to indecisiveness between picking up the hook and positioning the arm into a good location for sweeping.

Baselines are sensitive to Partial Teachers

In the example above, the behavioral policy selects a policy and naively executes it for several timesteps. Although the Hook-Grasp teacher allows the behavioral policy to grasp the hook, little else happens during the episode, since the behavioral policy does not realize that the Hook-Grasp teacher is not useful when the hook is grasped. Similarly, the Hook-Sweep teacher is executed at the wrong time, when the hook has neither been grasped nor positioned.

AC-Teach Behavioral Policy is insensitive to both

In the example above, the behavioral policy is able to switch its policy choice when the Hook-Grasp teacher has helped it grasp the hook, allowing for improved exploration, and resulting in a successful episode of interaction. This provides useful experience for training the agent.