1. Introduction

Reinforcement learning (RL) for continuous control often requires large amounts of online interaction data. Value-based RL methods, along with offline demonstration, can mitigate this burden by offering relatively high sample efficiency. However, these methods struggle to identify optimal actions when learning from suboptimal data.

To address these issues, we propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm.

ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks.
ARSQ auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks.

A motivating example of how Q decomposition influences policy training.

Consider a simple one-step decision-making task with two dimensional actions (a1, a2), and the training dataset consists of three distinct modes: one optimal mode with r = 1, two suboptimal modes with r = 0.1 and r = −1.

If the suboptimal modes are more prevalent in the dataset, conventional Q-learning approaches that estimate action dimensions independently, i.e., Q(s, ai), could undervalue the optimal mode. This bias can hinder the correct identification and reinforcement of the optimal action mode, leading to slow convergence and degraded policy performance.

2. Methods

We extend the soft Q-learning theory and propose Auto-Regressive Soft Q-learning (ARSQ) algorithm.

Coarse-to-fine Action Discretization

We hierarchically discretize each action dimension across multiple levels and progressively refine the action selection (Seo et al., 2024), narrowing down action choices in stages rather than evaluating all bins simultaneously.

Auto-regressive Neural Network

We propose dimensional soft advantage, which decomposes the soft advantage across action dimensions for efficient policy representation.

Actions are generated auto-regressively along each dimension, enables scalable soft Q-learning in high action dimension without compromising expressiveness.

3. Experiment

We evaluate performance when: (i) the offline dataset is suboptimal (ii) online collected data is suboptimal (iii) the key component is discarded.

Benchmark - D4RL

We compare our method with CQN, a state-of-the-art value-based RL algorithm, and Behavior Cloning (BC) to evaluate performance when training online with suboptimal datasets.

ARSQ achieves approximately 1.62× the overall performance of CQN. Notably, when using only the bottom 30% of the data, ARSQ attains 2.0× the performance of CQN, demonstrating its superior effectiveness with suboptimal data.

Additionally, we evaluate performance in a fully offline setting.

ARSQ achieves superior overall results compared to both offline RL and Imitation Learning methods.

Benchmark - RLBench

In addition to CQN and BC, we compare our method with actor-critic baselines DrQ-v2 and its improved variant DrQ-v2+, as well as a stronger behavior cloning baseline, ACT, to evaluate performance when online collected data is suboptimal.

Across these tasks, ARSQ consistently outperforms all other algorithms, highlighting its effectiveness in online learning with suboptimal data.

Ablation Study

Auto-regressive Conditioning: We examine the effects of swapping the order of coarse-to-fine and auto-regressive conditioning, or removing each component.

Shared Backbone: We also assess the impact of partially or fully removing the shared backbone in the network architecture.

Both ablation variants result in degraded performance, highlighting the importance of these design choices.

Page updated

Google Sites

Report abuse