JPDS-NN: Reinforcement Learning-Based Dynamic Task Allocation for Agricultural Vehicle Routing Optimization

Previous Works

Paper

Code

Abstract

The Entrance Dependent Vehicle Routing Problem (EDVRP) is a variant of the Vehicle Routing Problem (VRP) where the scale of cities influences routing outcomes, necessitating consideration of their entrances. This paper addresses EDVRP in agriculture, focusing on multi-parameter vehicle planning for irregularly shaped fields. To address the limitations of traditional methods, such as heuristic approaches, which often overlook field geometry and entrance constraints, we propose a Joint Probability Distribution Sampling Neural Network (JPDS-NN) to effectively solve the EDVRP. The network uses an encoder-decoder architecture with graph transformers and attention mechanisms to model routing as a Markov Decision Process, and is trained via reinforcement learning for efficient and rapid end-to-end planning. Experimental results indicate that JPDS-NN reduces travel distances by 48.4%–65.4%, lowers fuel consumption by 14.0%–17.6%, and computes two orders of magnitude faster than baseline methods, while demonstrating 15%–25% superior performance in dynamic arrangement scenarios. Ablation studies validate the necessity of cross-attention and pre-training. The framework enables scalable, intelligent routing for large-scale farming under dynamic constraints.

Contributions

Designing an encoder architecture based on graph transformers and attention mechanisms, which facilitates rapid end-to-end task planning while effectively utilizing farm-specific information.
Developing an actor network that samples actions from the joint probability distribution of working lines and their entrances, facilitating finer-grained task subdivision and enhancing the precision of agricultural vehicle operations.
Implementing a simulator to visualize task allocation outcomes and designing dynamic arrangement tasks to validate the practicality and advancement of our method in real-world scenarios.

Network Structure

The input encoder handles the task graph and vehicle features. The decoder, comprising a sequence encoder, actor network, and critic network, produces a sequence. In the MDP, the environment includes inputs, input encoder, and sequence encoder, with the actor network as the agent. The input encoder extracts high-dimensional input features, and the sequence encoder processes action features. At each step, the actor network chooses an action based on the state, determining the next node and its entrance.

Detailed structure of our networks. The network consists of an input encoder, a generated sequence encoder, actor networks and critic networks. The input encoder extracts high dimensional features from the inputs, while the generated sequence encoder processes features from the generated actions. At each step, the actor networks select an action based on the state, which specifies the next node and its entrance.

Training Results

The training curves of JPDS-NNs under four random seeds demonstrate the convergence and effectiveness of JPDS-NN. In the early stages, the optimization of distance, time, and fuel consumption aligns, but with training processes, these objectives may diverge or conflict. In the later stages, as the algorithm provides better allocation, the optimization directions converge again.

The results of JPDS-NN and baseline methods (RA and OGA) on the test set. All JDPS-NN methods significantly outperforms the genetic algorithm in travel distance and fuel consumption optimization. In addition, JPDS-NN methods excel in computational efficiency.

Dynamic Arrangement

The results of JPDS-NN and baseline methods on dynamic arrangements, including field increase and vehicle decrease. JPDS-NN outperforms OGA across all metrics for the three optimization objectives, which indicates the applicability of our methods in real-world scenarios.

Simulation results of filed increase task. In the first phase, all four vehicles depart from the depot and operate on Plot 0 (top-left) and Plot 3 (bottom-right). At 50% of the operation time, Veh 1 and Veh 3 have returned to the depot, Veh 2 is on a transfer path, and Veh 4 is within a working line. In the second phase, the remaining plots were added, and the vehicles commenced their operations from their respective starting points, eventually returning to the depot.

Simulation results of vehicle decrease task. In the first phase, all vehicles depart from the depot and operate across the entire field. At 25% of the operation time, Veh 2 and Veh 3 are within working lines, while the remaining vehicles have returned to the depot. In the second phase, only Veh 2 and Veh 4 were retained for task execution. After completing the remaining operations, Veh 3 returned directly to the garage, while Veh 4 remained at the depot.

Ablation Study

Pretraining

The network without pre-training converges as quickly or even faster in the early training stages. However, as training progresses, the pre-trained network shows less fluctuations and a shorter distance on the validation set.

Cross-attention

When each field has only one depot, the Vehicle-Field Cross Attention block in the encoder can be regarded as removed. Results show that the cross-attention mechanism not only enables the network to encode the start and end points of vehicled in the task graph, but also significantly improves the network's performance.

Page updated

Google Sites

Report abuse