Learning non-Markovian Decision-Making from State-only Sequences

Aoyang Qin,1,2, Feng Gao3, Qing Li2, Song-Chun Zhu1,2,4, Sirui Xie,5

1Department of Automation, Tsinghua University

2Beijing Institute for General Artificial Intelligence (BIGAI)

3Department of Statistics, UCLA

4School of Artificial Intelligence, Peking University

5Department of Computer Science, UCLA

* Equal contribution

[paper] [code]

Abstract

Conventional imitation learning assumes access to the actions of demonstrators, but these motor signals are often non-observable in naturalistic settings. Additionally, sequential decision-making behaviors in these settings can deviate from the assumptions of a standard Markov Decision Process (MDP). To address these challenges, we explore deep generative modeling of state-only sequences with non-Markov Decision Process (nMDP), where the policy is an energy-based prior in the latent space of the state transition generator. We develop maximum likelihood estimation to learn both the transition and the policy, which involves short-run MCMC sampling from the prior and importance sampling for the posterior. The learned model enables decision-making as inference: model-free policy execution is equivalent to prior sampling, model-based planning is posterior sampling initialized from the policy. We demonstrate the efficacy of the proposed method in a prototypical path planning task with non-Markovian constraints and show that the learned model exhibits strong performances in challenging domains from the MuJoCo suite.

Graphical model in MDP.

Graphical model in nMDP.

To demonstrate the necessity of non-Markovian value and test the efficacy of the proposed model, we designed a motivating experiment. Path planning is a prototypical decision-making problem, in which actions are taken in a 2D space, with the x-y coordinates as states. Policy of cubic curve planning is necessarily non-Markovian, since the historical states are needed to estimate the higher-order derivatives. Our model, when tested with varying context lengths, successfully learned the cubic property from the provided demonstrations. We use mean squared error to fit a cubic polynomial and use the residual error as a metric. When calculating the residual error, we exclude those with a third-order coefficient is less than 0.5. Actually, the acceptance rate itself is also a viable metric. It is the number of accepted trajectories divided by the total number of testing trajectories. It is complementary to the residual error because it directly measures the understanding of cubic polynomials.

Results for cubic curve generation.

merged_video2.mp4

goal.mp4

behavior.mp4

We also executed a more intricate experiment using the MuJoCo platform, characterized by high-dimensional state and action spaces. Impressively, our model not only exhibited steeper learning curves compared to state-only baselines but also matched or outperformed the benchmarks set by action-label baselines.

Results in MuJoCo.

Learning non-Markovian Decision-Making from State-only Sequences

Aoyang Qin*,1,2, Feng Gao3, Qing Li2, Song-Chun Zhu1,2,4, Sirui Xie*,5

Abstract

Aoyang Qin,1,2, Feng Gao3, Qing Li2, Song-Chun Zhu1,2,4, Sirui Xie,5