Learning while Sleeping:
Integrating Sleep-Inspired Consolidation with Human Feedback Learning
Imene Tarakli, Alessandro Di Nuovo
Sheffield Hallam University
Accepted as Oral Presentation in ICDL 2024
Best Paper Honourable Mention
Learning while Sleeping:
Integrating Sleep-Inspired Consolidation with Human Feedback Learning
Imene Tarakli, Alessandro Di Nuovo
Sheffield Hallam University
Accepted as Oral Presentation in ICDL 2024
Best Paper Honourable Mention
Sleep plays a vital role in developmental learning. It allows the brain to consolidate daily learning experiences by replaying the memories accumulated throughout the day. In this work, we take inspiration from sleep and propose the Inverse Forward Offline Reinforcement Model (INFORM), a scalable framework that first learns a set of behaviours from human evaluative feedback, then consolidates the learning by applying an offline inverse reinforcement learning to the memorised trajectories. Experimental results demonstrate that INFORM is a feedback-efficient method that effectively learns an optimal policy that align with the intended behaviour of the human. A comparative analysis shows that the learnt policies are robust to dynamics changes in the environment and the recovered rewards allows the robot to be autonomous in its learning.
We present the INverse Forward Offline Reinforcement Model (INFORM), a framework that scalably learns generalised policies and reward functions from human feedback. The model consists of two phases :
A Forward model: In this initial phase, we use a myopic interactive RL based on human feedback to train a preliminary, low-level policy
An Offline Inverse model: Subsequently, we revisit all the trajectories generated by the previous phase and apply a non-myopic offline IRL to derive a policy and reward function that more accurately capture the task's high-level objectives
We evaluate the robustness of INFORM on the Pusher-V4. We train both INFORM and TAMER on the original dynamics of the environment then evaluate these model on a perturbed version of the taks where we introduce an obsacle along the optimal trjaectory. We notice the high-level policy obtained with INFORM significantly moves the object closer to its target compared to the low-level policy obtained with TAMER.
We investigate whether the reward function recovered by INFROM is in alignment with the teacher's intended outcome. We train TAMER and INFORM in a gridworld where the task objective is to reach the goal in as few steps as possible.
We then modify the environment by blocking the optimal trajectory with a wall. We train agents with the recovered reward from INFORM in this modified environment.
Using the INFORM recovered reward function, agents can autonomously learn a new optimal policy in this perturbed environment, however TAMER agents uniformly failed to achieve the goal. These policies, developed from direct human feedback, did not incorporate adjustments for the new wall, requiring additional human guidance to adapt to these environmental changes