DiffClone: 

Enhanced Behaviour Cloning in Robotics with Diffusion-Driven Policy Learning

Sabariswaran M1, Abhranil Chandra* 1, 2, Sreyas Venkataraman*1,  Adyan Rizvi*1, Yash Sirvi*1, Soumojit Bhattacharya*1, Aritra Hazra1

1. Indian Institute of Technology Kharagpur, West Bengal, India                 2. University of Waterloo, Ontario, Canada

{sabaris.offl, vsreyas20, adyan2004, yashsirvi, soumojit048}@kgpian.iitkgp.ac.in ; abhranil.chandra@uwaterloo.ca ; aritrah@cse.iitkgp.ac.in

NeurIPS 2023 - TOTO Challenge

Abstract

       Schematic Model of our proposed DiffClone Framework:

 A generative model that takes input the latest T_o observations O_t and predicts T_a subsequent actions At, at each time step t. In the CNN variant, it uses Feature-wise Linear Modulation (FiLM) for conditioning at each convolution layer 

Robot learning tasks are extremely compute-intensive and hardware-specific. Thus the avenues of tackling these challenges, using a diverse dataset of offline demonstrations that can be used to train robot manipulation agents, is very appealing. The Train-Offline-Test-Online (TOTO) Benchmark provides a well-curated open-source dataset for offline training comprised mostly of expert data and also benchmark scores of the common offline-RL and behaviour cloning agents. In this paper, we introduce DiffClone, an offline algorithm of enhanced behaviour cloning agent with diffusion-based policy learning, and measured the efficacy of our method on real online physical robots at test time. This is also our official submission to the Train-Offline-Test-Online (TOTO) Benchmark Challenge organized at NeurIPS 2023. We experimented with both pre-trained visual representation and agent policies. In our experiments, we find that MOCO finetuned ResNet50 performs the best in comparison to other finetuned representations. Goal state conditioning and mapping to transitions resulted in a minute increase in the success rate and mean-reward. As for the agent policy, we developed DiffClone, a behaviour cloning agent improved using conditional diffusion. 

Example Data Desciption

TOTO Task Suite: The benchmark tasks are pouring and scooping. Each involves challenging variations in objects, positions, and more.  The dataset consists of over 1.26 million images of robot actions in 1895 trajectories of scooping data and 1003 trajectories of pouring data. The dataset consists of RGB and depth images, along with the joint states of the arm, the actions, and a sparse reward for each time-step

Our Proposed Algorithm

Salient Points

In DiffClone, we start by selectively sub-sampling trajectories to create a subset of "expert" data. This involves choosing trajectories with the highest rewards, ensuring the dataset captures optimal behaviour. Following this, we employ a Momentum Contrast (MoCo) model, fine-tuned on our datasets, as our visual-encoder backbone. This model processes images to extract relevant states. Once these states are obtained, we normalize them across the dataset to enhance the stability of the policy we intend to learn. Finally, we implement a behaviour cloning agent using a CNN-based Diffusion Policy. We chose this strategy over other offline RL alternatives, motivated by our success in generating an expert dataset that accurately represents the distribution of the given trajectories. The "expert" dataset’s quality and representativeness allowed us to use behaviour cloning techniques that more effectively and accurately replicated the desired optimal behaviour and policy. 

Typical methods in improving behaviour cloning are done by either looking more into the future (increasing horizon length) or by using better sequence modeling architectures such as RNNs, LSTMs etc. Diffusion models have proven successful in capturing complex distributions and efficiently preserving the multi-modality of the distributions they model. 

Experimental Results

Manipulation Results of Trained Policy