@InProceedings{hejna23distance,
title = {Distance Weighted Supervised Learning for Offline Interaction Data},
author = {Hejna, Joey and Gao, Jensen and Sadigh, Dorsa},
booktitle = {Proceedings of the 40th International Conference on Machine Learning},
series = {Proceedings of Machine Learning Research},
publisher = {PMLR}
url = {https://arxiv.org/abs/2304.13774}
}
Sequential decision making algorithms often struggle to leverage different sources of unstructured offline interaction data. Imitation learning (IL) methods based on supervised learning are robust, but require optimal demonstrations, which are hard to collect. Offline goalconditioned reinforcement learning (RL) algorithms promise to learn from sub-optimal data, but face optimization challenges, especially with high-dimensional data like images. To bridge the gap between IL and RL, we introduce Distance Weighted Supervised Learning or DWSL, a supervised method for learning goal-conditioned policies from offline data. DWSL models the entire distribution of time-steps between states in offline data with only supervised learning, and uses this distribution to approximate shortest path distances. To extract a policy, we weight actions by their reduction in shortest-path estimates. Theoretically, DWSL converges to an optimal policy constrained to the data distribution, an attractive property for offline learning, without any bootstrapping. Across all datasets we test, DWSL empirically maintains behavior cloning as a lower bound, while still exhibiting policy improvement. In high-dimensional image domains, DWSL surpasses the performance of both prior goal-conditioned IL and RL algorithms.
Below we show visualizations for the Gym Robotics environments learning from pixels, and Franka Kitchen. We show a rollout of our method, DWSL, versus GCSL, a pure imitation learning based approach, and WGCSL, a Q-Learning based approach.
The red sphere shows the desired pushing location.
The red sphere shows the desired ending location of the cube.
The red sphere shows the goal location for the puck.
The goal for the agent is to match two of its fingers together. The final policy for WGCSL appears more unstable, and constantly shakes. The GCSL hand makes no progress.
Four tasks should be completed in sequence. Here we see GCSL and DWSL complete all four tasks, but the Q-Learning approach, WGCSL, is only able to complete two.