PG-DPO: Pontryagin-Guided Direct Policy Optimization
Forward simulation. BPTT costates. Hamiltonian recovery.
PG-DPO: Pontryagin-Guided Direct Policy Optimization
Forward simulation. BPTT costates. Hamiltonian recovery.
The interactive sketch illustrates the basic PG-DPO mechanism in the Merton model.
Stage 1: Warm-up
A feasible policy network is trained by direct simulation.
Stage 2: Costate estimation + Control recovery
BPTT through continuation rollouts produces pathwise costate estimates.
Monte Carlo averaging stabilizes these estimates into an adjoint signal.
The estimated costate is plugged into the Hamiltonian optimality condition, recovering the control directly.
Technical note. Why does BPTT produce costates?
[Read the technical note: BPTT as a Pathwise Costate Solver]
Classical HJB methods provide rigorous verification, but they often suffer from the curse of dimensionality.
Deep reinforcement learning scales better, but it can lose the structural optimality conditions that make continuous-time control interpretable and reliable.
PG-DPO aims to keep both sides:
the scalability of neural policies,
the structural discipline of Pontryagin’s maximum principle,
and the numerical precision of local Hamiltonian control recovery.
The central shift is simple:
Do not learn the whole value landscape first.
Learn the policy path, estimate the costate, and recover the control locally.
Many difficult control problems are hard not because the final policy is complex, but because the intermediate structure is delicate.
Examples include:
high-dimensional portfolio choice,
hard constraints,
parameter uncertainty,
non-Markovian or delay-driven dynamics,
non-exponential discounting,
and transaction costs with no-trade regions.
In these settings, global value-function learning can be unstable or unnecessarily expensive.
PG-DPO instead uses simulated rollouts and adjoint sensitivities to enforce local optimality conditions directly.
Beyond the Merton example, the PG-DPO idea can be extended to constrained control, non-Markovian dynamics, non-exponential discounting, transaction costs, and other continuous-time decision problems.
In constrained problems, the final Hamiltonian recovery step can become a local KKT, barrier, or QP-style decoder.
In transaction-cost problems, the same costate-to-control principle can be adapted to recover buy/hold/sell regimes and no-trade regions.
Breaking the Dimensional Barrier: A Pontryagin-Guided Direct Policy Optimization for Continuous-Time Multi-Asset Portfolio [Link]
Breaking the Dimensional Barrier for Constrained Dynamic Portfolio Choice, under revision in Mathematical Finance, 2026 [Link]
Breaking the Dimensional Barrier: Dynamic Portfolio Choice with Parameter Uncertainty via Pontryagin Projection [Link]
Beyond the Bellman Recursion: A Pontryagin-Guided Framework for Non-Exponential Discounting, accepted in International Conference on Machine Learning (ICML), 2026 [Link]
Rec-ve-ing the Ki--s in D--ay C-nt-ol: A Str-ct-re-Aw-re O-timal Con--ol So-ve- wit- Pon-ry-gin -roj-ction
Rec-ve-ing No-Tr-de Re-i-ns: Pont--a-in-Gui--d Po--cy Proj--tion f-r Tr--action-C-st -ont-ol