Anonymous Authors
Paper
Code
Abstract: The Robotic Task Sequencing Problem (RTSP) involves determining both the order in which a robotic manipulator visits a set of targets and the specific configuration for reaching each target, accounting for multiple feasible poses arising from kinematic redundancy. Existing methods often rely on straight-line path-cost approximations, which can be highly inaccurate in cluttered environments. While pre-computing motion costs is feasible in static industrial setups with a limited number of fixed targets, it becomes prohibitively expensive in novel environments—such as inspection, cleaning, or disinfection scenarios—where the workspace may be previously unseen, yet rapid solution prediction is crucial for real-time operation. To address this, we propose RTSP-Net, a deep reinforcement learning framework that takes as input the environment represented as a point cloud, along with target points and their associated manipulator configurations, and outputs both the visiting sequence and the corresponding configuration for each target. RTSP-Net is trained across four distinct environments with varying obstacle layouts, leveraging a neural cost model to provide realistic path-cost-based rewards during training. We benchmark RTSP-Net in these simulated environments. It achieves approximately 4–32% improvement in path length and 2–29% improvement in trajectory execution time, while generating solutions roughly 40% faster than baseline methods.
RTSP-Net was evaluated in four simulated environments—Empty World, Cubby, Random Obstacle, and Tabletop—showing 4–32 % shorter paths and 2–29 % faster trajectory execution across varying numbers of targets. The accompanying video compares RTSP-Net with Cluster-TSP: unvisited targets appear in blue, visited targets in green. RTSP-Net achieves 26 % faster execution than Cluster-TSP.
RTSP-Net
Baseline Algorithm: Cluster-TSP
A hardware demonstration was conducted on a Franka Emika robot, where the end effector was required to stop at points 6 cm above protruding cylinders on a panel. In this task, the RTSP-Net tour achieved an 8 % faster execution time compared to the baseline.
RTSP-Net
Cluster-TSP
RTSP-Net Architecture
RTSP-Net encodes an RTSP instance by first representing the scene as a point cloud, which is processed through a sparse PointNet++ encoder. The resulting embedding is passed to the Target Set Encoder and Configuration Set Encoder. The Target Set Encoder uses feed-forward and cross-attention layers to jointly encode the scene and target locations, including end-effector orientations, while the Configuration Set Encoder similarly encodes candidate configurations together with the scene. Embeddings from both encoders are fused in a Joint Context Encoder, implemented as a transformer, producing a unified embedding of the RTSP instance. The Sequencing Module and Configuration Selection Module then generate the target sequence and associated configurations autoregressively: the sequencing module predicts the next target based on the global embedding and previously visited targets, and the configuration selection module predicts the corresponding configuration conditioned on the selected target and prior steps. Finally, path costs are computed using neural cost models, which serve as rewards to train RTSP-Net via REINFORCE.