Long-horizon manipulation is challenging: as horizons grow, errors compound and small deviations cascade; mitigating these effects often calls for reasoning jointly about task sequencing and motion feasibility. We introduce LoHR-Bench, a dual-level, method-agnostic benchmark of 24 tasks across four suites—tool-use, active exploration, clutter, and extended–long–horizon—that evaluates both high-level task reasoning and low-level motion execution. Tasks are specified in PDDL and grounded in ManiSkill3, linking symbolic descriptions to continuous execution. The benchmark provides (i) a data pipeline with synchronized rollouts that align language, symbolic plans, and low-level trajectories. (ii) a concise Gym-style interface supports high-level-only, low-level-only and dual-level policies, and (iii) evaluation metrics that report progress toward goal predicates, final success, and end-to-end time. Together, these components offer a focused, reproducible testbed for measuring and advancing integrated task–motion reasoning across classical TAMP planners and modern VLA systems.
(Left) Four long-horizon suites designed with dual-level constraints.
(Top right) Multiple-strategies datasets: high-level PDDL or language aligned with low-level trajectories under domain randomization.
(Bottom Left) Gym-style API supporting multi-level agents via TaskPlanEnv and MotionPlanEnv.
(Right) Evaluation metrics: progress score, completion time, success rate.
Introduction Video
Tasks List
Extended-Long-Horizon
Exploration
Tool-Use
Clutter
Results