Abstract

Sleep plays a vital role in developmental learning. It allows the brain to consolidate daily learning experiences by replaying the memories accumulated throughout the day. In this work, we take inspiration from sleep and propose the Inverse Forward Offline Reinforcement Model (INFORM), a scalable framework that first learns a set of behaviours from human evaluative feedback, then consolidates the learning by applying an offline inverse reinforcement learning to the memorised trajectories. Experimental results demonstrate that INFORM is a feedback-efficient method that effectively learns an optimal policy that align with the intended behaviour of the human. A comparative analysis shows that the learnt policies are robust to dynamics changes in the environment and the recovered rewards allows the robot to be autonomous in its learning.

INFORM Framework

We present the INverse Forward Offline Reinforcement Model (INFORM), a framework that scalably learns generalised policies and reward functions from human feedback. The model consists of two phases :

A Forward model: In this initial phase, we use a myopic interactive RL based on human feedback to train a preliminary, low-level policy
An Offline Inverse model: Subsequently, we revisit all the trajectories generated by the previous phase and apply a non-myopic offline IRL to derive a policy and reward function that more accurately capture the task's high-level objectives

Robustness Evaluation

We evaluate the robustness of INFORM on the Pusher-V4. We train both INFORM and TAMER on the original dynamics of the environment then evaluate these model on a perturbed version of the taks where we introduce an obsacle along the optimal trjaectory. We notice the high-level policy obtained with INFORM significantly moves the object closer to its target compared to the low-level policy obtained with TAMER.

TAMER

INFORM

Reward Evaluation

We investigate whether the reward function recovered by INFROM is in alignment with the teacher's intended outcome. We train TAMER and INFORM in a gridworld where the task objective is to reach the goal in as few steps as possible.

We then modify the environment by blocking the optimal trajectory with a wall. We train agents with the recovered reward from INFORM in this modified environment.

Using the INFORM recovered reward function, agents can autonomously learn a new optimal policy in this perturbed environment, however TAMER agents uniformly failed to achieve the goal. These policies, developed from direct human feedback, did not incorporate adjustments for the new wall, requiring additional human guidance to adapt to these environmental changes

TAMER

INFORM Reward

Page updated

Google Sites

Report abuse