While there has been substantial success in applying actor-critic methods to continuous control, simpler critic-only methods such as Q-learning often remain intractable in the associated high-dimensional action spaces. However, most actor-critic methods come at the cost of added complexity: heuristics for stabilisation, compute requirements as well as wider hyperparameter search spaces. We show that these issues can be largely alleviated via Q-learning by combining action discretization with value decomposition, framing single-agent control as cooperative multi-agent reinforcement learning (MARL). With bang-bang actions, performance of this critic-only approach matches state-of-the-art continuous actor-critic methods when learning from features or pixels. We extend classical bandit examples from cooperative MARL to provide intuition for how decoupled critics leverage state information to coordinate joint optimization, and demonstrate surprisingly strong performance across a wide variety of continuous control task.
Discretize continuous action space along each dimension by only considering bang-bang actions
Instead of enumerating the action space, add 1 value output per dimension per bin to the Q-network
Recover overall value function by choosing one output per action dimension and taking the mean
This assumes a linear value function decomposition and treats single-agent continuous control as a multi-agent discrete control problem. The key difference to the original DQN agent is the reduced number of output dimensions of the Q-network and the additional aggregation across action dimensions. The remaining structure of the original agent may be left unchanged.
Performance for state-based control on DeepMind Control Suite and MetaWorld tasks. We compare DecQN with a discrete bang-off-bang policy to the continuous D4PG and DMPO agents. Mean and standard deviation are computed over 10 seeds with a single set of hyperparameters. DecQN yields performance competitive with state-of-the-art continuous control agents, scaling to high-dimensional Dog tasks via a simple decoupled critic representation and an epsilon-greedy policy.
Performance on pixel-based control tasks from the DeepMind Control Suite, comparing DecQN with bang-bang policy to the continuous DrQ-v2 and Dreamer-v2 agents. We note that DecQN successfully accommodates the additional representation learning challenges. The best performing runs on Humanoid indicate that DecQN can efficiently solve complex tasks from vision, potentially requiring environment-specific hyperparameter settings or more sophisticated exploration.
Quadruped (|A|=12)
DecQN
D4PG
Humanoid (|A|=21)
DecQN
D4PG
Dog (|A|=38)
DecQN
D4PG
DecQN and DQN on cooperative matrix games. Left: a two-step game where agent 1 selects in step 1 which payoff matrix is used in step 2 (top vs. bottom). Learned Q-values of DecQN indicate that accurate values around the optimal policy are sufficient (epsilon = 0.5) even when the full value distribution cannot be represented well (epsilon = 1.0). Right: matrix game with actions as acceleration input to pointmass (x vs. y). DecQN struggles to solve the 1-step game (no dynamics). In the multi-step case, DecQN leverages velocity information to coordinate action selection (middle).