Spatio-Temporal Logic (SpaTiaL) offers a principled formalism for expressing geometric spatial requirements—an essential component of robotic manipulation, where object locations, neighborhood relations, pose constraints, and interactions directly determine task success. Yet prior works have largely relied on standard temporal logic (TL), which models only robot trajectories and overlooks object-level interactions. Existing datasets built from randomly generated TL formulas paired with natural-language descriptions therefore cover temporal operators but fail to represent the layered spatial relations that manipulation tasks depend on. To address this gap, we introduce a dataset generation framework that synthesizes SpaTiaL specifications and converts them into natural-language descriptions through a deterministic, semantics-preserving back-translation procedure. This pipeline produces the NL2SpaTiaL dataset, aligning natural language with multi-level spatial relations and temporal objectives to reflect the compositional structure of manipulation tasks. Building on this foundation, we propose a translation–verification framework equipped with a language-based semantic checker that ensures the generated SpaTiaL formulas faithfully encode the semantics specified by the input description. Experiments across a suite of manipulation tasks show that SpaTiaL-based representations yield more interpretable, verifiable, and compositional grounding for instruction following.
SpaTiaL is a spatial temporal logic for expressing geometry-aware relations and temporal constraints over objects, regions, and robot trajectories. It enables executable specification checking via robustness evaluation, supporting verification, trajectory ranking, and integration with planning or learning systems.
The right side narrates how a simple human instruction unfolds into a sequence of grounded SpaTiaL constraints, bridging language, logic, and physical execution within a unified framework.
NL2SpaTiaL formulates instruction understanding as a structured specification generation process rather than a single-step translation. Given a natural language instruction, the system first decomposes it into a Hierarchical Logical Tree (HLT), where each node represents a meaningful sub-intent aligned with a specific linguistic span. This hierarchy explicitly captures how high-level goals are refined into spatial relations, temporal constraints, and conditional requirements.
Each leaf node in the hierarchy is grounded into a SpaTiaL primitive parameterized by objects, regions, and temporal windows. Intermediate nodes define how these primitives are composed using logical and temporal operators, allowing complex behaviors to be assembled in a transparent and controllable manner.
The final SpaTiaL specification is obtained by composing all grounded subformulas according to the hierarchical structure.
To support hierarchical instruction understanding and step-wise supervision, we construct the NL-SpaTiaL dataset, a collection of natural language instructions paired with structured SpaTiaL specifications. Rather than treating specifications as flat formulas, each data instance includes a hierarchical annotation that aligns linguistic fragments with nodes in the corresponding logical tree.
The dataset is generated by first fixing a maximal hierarchical structure and then sampling concrete spatial relations, temporal intervals, and object configurations within this structure. Logical leaves are instantiated with parameterized SpaTiaL primitives and subsequently back-translated into natural language, ensuring consistency between language and logic at every level. And a large language model is used to generate diverse paraphrases while preserving the underlying semantics.
Back-translation refers to the process of converting a structured SpaTiaL specification back into natural language in a controlled and deterministic manner. Here is an example:
A root SEQUENCE node connects two intermediate operators. The first intermediate node is WITHIN [0,10], grouping the leaves Inside(obj_r, reg_s) and Above(obj_r, obj_b). The second intermediate node is ALWAYS, grouping the leaves FarFrom(obj_r, obj_g) and LeftOf(obj_r, obj_y).
The predicate Inside(obj_r, reg_s) is translated as “the red object is inside the sorting region,” and Above(obj_r, obj_b) is translated as “the red object is above the blue object.” These two phrases are linked by the WITHIN [0,10] operator, yielding the first clause: “within the first 10 seconds, place the red object inside the sorting region while keeping it above the blue object.”
Similarly, the second clause is: “always keep it far from the green object and to the left of the yellow object.” Finally, these two clauses are composed by the SEQUENCE root operator, producing the complete instruction: “Within the first 10 seconds, place the red object inside the sorting region while keeping it above the blue object; afterward, always keep it far from the green object and to the left of the yellow object.”
We evaluate NL2SpaTiaL across multiple manipulation environments, including the original SpaTiaL benchmark environments, ReKep (a suite of multi-task manipulation scenarios) and the openpi0 environment.
Particularly, a VLA model can generate multiple candidate trajectories, which are evaluated using SpaTiaL specifications to identify the most consistent and reliable execution, enabling effective selection among diverse action proposals across different environments.
If you find this work useful, please cite:
@article{luo2025nl2spatial,
title = {NL2SpaTiaL: Generating Geometric Spatio-Temporal Logic Specifications from Natural Language for Manipulation Tasks},
author = {Luo, Licheng and Xia, Yu and Liang, Kaier and Cai, Mingyu},
journal = {arXiv preprint arXiv:2512.13670},
year = {2025},
url = {https://arxiv.org/abs/2512.13670},
}