Reinforcement learning has shown its strength in challenging sequential decision-making problems. The reward function in RL is crucial to the learning performance, as it serves as a measure of the task completion degree. In real-world problems, the rewards are mostly human-designed, which requires laborious tuning, and is easily affected by human cognitive biases.
To achieve automatic auxiliary reward generation, we propose a novel representation learning approach :TDRP,which can measure the "transition distance" between states. Built upon these representations, we introduce an auxiliary reward generation technique for both single tasks and skill-chaining scenarios.
The proposed approach is evaluated in a wide range of manipulation tasks and navigation task. We conduct experiments in two simulated robot navigation tasks and five manipulation tasks. The navigation tasks, Maze2D and AntMaze, are sourced from the D4RL benchmark. Table-wiping task (Wipe) and Nut-assembly task (NutAssemblySquare) are from the Robosuite benchmark . The Pick-nut (Pick), Place-nut-on-bolt (Place), and Screw-nut (Screw) tasks are from the Factory benchmark, which includes a Franka Panda arm, a table, a nut, and a bolt. All components have high-quality simulations and rich meshes, as the complex control required to solve the tasks demands high-fidelity simulations. The experiment results demonstrate the effectiveness of measuring the "transition distance" between states and the induced improvement by auxiliary rewards, which not only promotes better learning efficiency but also increases the convergent stability.
Maze2D
In this scenario, a ball is randomly placed within a maze, navigating obstacles to reach target locations. The task involves controlling the movement direction of the ball to maneuver around obstacles and reach designated targets.
AntMaze
In this setup, an 8-DOF "Ant" quadruped robot is situated within a maze. The goal is to control the robot's degrees of freedom (DOF), enabling it to navigate around obstacles and reach specified target locations.
Table-wiping
In this setup, a table featuring a whiteboard surface with various markings is positioned in front of a single robot arm. The robot arm is equipped with a whiteboard eraser mounted on its hand, and its objective is to learn how to wipe the whiteboard surface clean, removing all of the markings. The initial arrangement of markings on the whiteboard is randomized at the start of each episode.
Screw-nut
This task requires screwing down the nut to a certain height, and the nut is initialized on the top of a bolt placed on a table.
Pick-nut
This task requires the robot arm to grasp the nut with a parallel-jaw gripper on a work surface. The initial location of the nut is randomized in each episode.
Place-nut-on-bolt
In this scenario, a single robot arm transports the nut to the top of a bolt. The initial location of the robot is randomized and the bolt is fixed to the surface in each episode.
With the learned TDRP representation, we generate auxiliary rewards as we proposed in paper for the tasks and train the policies with the reshaped rewards. The videos for the learned policies are shown here.
We perform a sim-to-real transfer of our method by deploying the policy trained in a simulation environment to control the Franka Panda arm in the real world for manipulation tasks. Specifically, we demonstrate pick task, screw task and skill-chaining task which involve picking, placing, and screwing actions. The videos are shown here.