Trakdis: A Transformer-based Knowledge Distillation Approach for Visual Reinforcement Learning with application to Cloth Manipulation

Wei Chen and Nicolas Rojas

 REDS Lab, Dyson School of Design Engineering, Imperial College London

Introduction

Approaching robotic cloth manipulation using reinforcement learning based on visual feedback is appealing as robot perception and control can be learned simultaneously. However, major challenges result due to the intricate dynamics of cloth and the high dimensionality of the corresponding states, what shadows the practicality of the idea. To tackle these issues, we propose TraKDis, a novel Transformer-based Knowledge Distillation approach that decomposes the visual reinforcement learning problem into two distinct stages. In the first stage, a privileged agent is trained, which possesses complete knowledge of the cloth state information. This privileged agent acts as a teacher, providing valuable guidance and training signals for subsequent stages. The second stage involves a knowledge distillation procedure, where the knowledge acquired by the privileged agent is transferred to vision-based agents by leveraging pre-trained state estimation and weight initialization.

Challenges of Cloth Manipulation

Tracking cloth state in a dynamic environment is an open challenge in robotics and computer vision. Although it is possible to train a perfect policy with cloth-state information in simulation, this state-based policy can not be applied in the real world.  Therefore, a control policy with approachable sensory (e.g. RGB images) input is necessary for the real-world.

TraKDis: overview

We propose a method that distils knowledge from a privileged agent with access to state information to train the student (vision-based) agent. Our methods have two major advantages: 


Our objective is to address highly dynamic cloth manipulation tasks using only RGB image input. In contrast to the asymmetric input of RGB images and states for the actor-critic architecture in previous works, we utilize a one-to-one knowledge distillation (KD) for learning vision-based tasks. Illustrated in the figure, our approach, TraKDis, decomposes the vision agent's learning into two stages.

First, we train a privileged agent that takes privileged cloth state information as input. Then, we employ KD to distil knowledge from the privileged agent to the student agent. To reduce the domain gap between visual observations and states during distillation, we employ a pre-trained CNN state estimation encoder as a prior. Since both models share the same architecture, we initialize the student policy weights from the privileged agent at the start of distillation, resulting in significant improvements in knowledge transfer facilitation.

Reinforcement Learning by Decision Transformer

We adopt a Decision Transformer (DT) architecture for the training of RL policy. We first conduct offline training using the expert data collected from the human-designed heuristic algorithm. An online fine-tuning is then applied for the trained model to obtain a better performance

TraKDis: Knowledge Distillation for Learning Visual Control Agent

Due to the large domain gap between state ground-truth information and image observation, two major components are proposed to facilitate this knowledge distillation procedure: a CNN state estimation encoder and the weight initialization.

Pre-trained CNN Encoder for State Estimation

We design a CNN encoder to estimate the state information from image observation. Image augmentation is applied to improve the robustness of state estimation.

Knowledge Distillation via Weight Initialization

We run the teacher and student policy simultaneously for knowledge distillation. The teacher policy has access to all the state dynamics of the cloth and thus performs better. The student policy, which receives image inputs, is trained to imitate the actions of the teacher policy. By using a pre-trained CNN, the image can be estimated to encoded input S'. Since the student agent and privileged agent have the same architecture, we initialize the weight of the student policy by copying the weights from the teacher's policy, which aids the knowledge distillation process. The parameters of CNN and teacher policy are frozen during training. Only the student policy is updated via the imitation loss.

Experiments Setup

We assess the effectiveness of our approach across these tasks, making comparisons with other state-of-the-art algorithms. Moreover, we perform an ablation study to ascertain the importance of individual components within our method. Furthermore, we showcase our model's ability to manage state estimation perturbations and dynamically adjust. The practical applicability of our approach is evaluated through real-world demonstrations conducted within this section.  More results and analysis can be found in the following presentation section.

TraKDis running samples

Real-world running samples

Detailed Presentation Video and Experiment Trials 

A detailed explanation of our method, results and analysis, and more experiment trails can be found in this video (audio included).

If you have any questions, please feel free to contact me via W.CHEN21@IMPERIAL.AC.UK