From Words to Routes: Applying Large Language Models for Vehicle Routing Problems


Zhehui Huang, Guangyao Shi, Gaurav Sukhatme

Abstract

LLMs have shown impressive progress in robotics tasks (e.g., manipulation and navigation) with natural language task descriptions in recent years. The success of LLMs in these tasks leads us to wonder:  what is LLMs' ability to solve vehicle routing problems (VRPs), widely used in modeling robotic decision-making problems with natural language task descriptions? In this work, we study this question in three steps. First, we construct a dataset with 21 single- or multi-vehicle routing problem variants. Second, we evaluate the performance of LLMs across four basic prompt paradigms of text-to-code generation, each involving different types of text input. These include: 1) generating code from natural language task descriptions, 2) transforming natural language task descriptions into mathematical formulations and generating corresponding code, 3) generating code from natural language task descriptions with the integration of a specified external solver or library (e.g., Gurobi), and 4) converting natural language task descriptions into mathematical formulations and generating code with the incorporation of a specified solver or library (e.g., Gurobi). The basic prompt 1) performs the best for GPT-4, achieving 56% feasibility, 40% optimality, and 53% efficiency. Third, based on the observations in the second step, we propose a framework that enables LLMs to refine solutions through self-reflection, including self-debugging and self-verification. With GPT-4, our proposed framework achieves a 16% increase in feasibility, a 7% increase in optimality, and a 15% increase in efficiency. Moreover, we study the sensitivity of GPT-4 to input descriptions, i.e., how GPT-4 reacts if some details in the input description are removed. After removing some details, the performance of GPT-4 decreases by 4% in feasibility, 4% in optimality, and 5% in efficiency. 

Framework

Ten vehicle routing problems for a single vehicle

Eleven vehicle routing problems for multiple vehicles

The performance of GPT-4 in four basic prompt paradigms.

Without using our proposed framework

The performance of GPT-4 in four basic prompt paradigms.

With our proposed framework

The relationship between performance and the number of reflections. 

In the legend, 0 means no reflection, 3 means the maximum reflection

number is 3, and 6 means the maximum reflection number is 6. Each task

is evaluated ten times separately.

Comparing the performance of GPT-4 in task descriptions with and without details to explain tasks further. 

Mechanism overview: ask users for help

We use this mechanism when to identify if task descriptions are thorough.

Citation

Release soon!