Scalable Multi-Agent Reinforcement Learning for Warehouse Logistics with Robotic and Human Co-Workers

Aleksandar Krnjaic, Raul D. Steleac, Jonathan D. Thomas, Georgios Papoudakis, Lukas Schäfer,
Andrew To, Kuan-Ho Lao, Murat Cubuktepe, Matthew Haley, Peter Börsting, Stefano V. Albrecht

Affiliations: Dematic, The University of Edinburgh

ArXiv link: Scalable Multi-Agent Reinforcement Learning for Warehouse Logistics with Robotic and Human Co-Workers

Overview

This report shows results of a general-purpose and scalable MARL solution for the order-picking problem in a warehouse environment. We tackle the order-picking problem with worker agents of two distinct classes, which we call Pickers and Automated Guided Vehicles (AGVs). Pickers could be human or robotic workers, whose responsibilities include taking items off shelves and placing them onto AGVs. AGVs are load carrying vehicles that are responsible for consolidating picked items and carrying them to the next downstream operation (typically a packing station). The high-level objective is to improve the picking efficiency by increasing order throughput, measured in order-lines per hour. An order is made up of several order-lines which specify a required item and quantity.

We present a set of new MARL algorithms called Hierarchical Independent Actor-Critic (HIAC), Hierarchical Shared Network Actor-Critic (HSNAC), and Hierarchical Shared Experience Actor-Critic (HSEAC) which improve convergence rate over non-hierarchical MARL approaches through decomposition of the large action space via a multi-layer hierarchy. We contrast the HIAC, HSNAC and HSEAC algorithms with the non hierarchical MARL approaches, Independent Actor-Critic (IAC), Shared Network Actor-Critic (SNAC), Shared Experience Actor-Critic (SEAC) and with the two industry heuristics, Pick Don't Move (PDM) and Follow Me (FM).

The hierarchical algorithms outperform the non-hierarchical baselines as well as the well-established industry heuristics while being generally applicable to varying warehouse specifications and operating contexts.

Warehouse Specification

We consider four different warehouse configurations representing real-world scenarios.

Small

Item Locations: 200
Aisles: 2
Pickers: 4
AGVs: 8
Average orderlines per order: 5
Number of orders: 80

This configuration is known as a fast-pick area, typically used when there are items that are frequently picked, and separated from the larger general storage to allow faster access.

Medium

Item Locations: 400
Aisles: 10
Pickers: 6
AGVs: 12
Average orderlines per order: 5
Number of orders: 80

This is a typical warehouse configuration with passage between aisles possible from the top and bottom of the aisle, as well as through a corridor inside the rack structure.

Large

Item Locations: 1276
Aisles: 22
Pickers: 8
AGVs: 16
Average orderlines per order: 5
Number of orders: 80

This is a larger version of the medium warehouse. This warehouse is directly derived from a real Dematic customer site.

Disjoint

Item Locations: 1392
Aisles: 12 + 12
Pickers: 4
AGVs: 16
Average orderlines per order: 2
Number of orders: 80

This configuration is a warehouse separated into two areas connected by a single corridor. This occurs when separation of items is required, for example in refrigeration systems (eg. cold storage separated from regular), or when hanging goods such as clothing is separated from general merchandise.

Environment Renders

We use videos generated through the game engine to qualitatively assess agent behaviours whilst they are carrying out the warehouse operation. Videos for the MARL approaches and the two heuristic approaches can be viewed below.

Heuristic - PDM

MARL - IAC

MARL - SNAC

MARL - SEAC

Heuristic - FM

MARL - HIAC

MARL - HSNAC

MARL - HSEAC

Results

We consider multiple metrics to evaluate training performance in this section.

All algorithms are trained with 5 random seeds. Training curve shows inter-quartile mean and shaded area shows 95% stratified bootstrap confidence interval using 100 bootstrap replications. Average smoothing applied over 40 episodes.

Order-lines per Hour

A metric which is commonly used in the warehousing domain is Orderlines per Hour. An orderline is defined as a tuple of (item ID, item quantity). A better performing algorithm will have a higher number of orderlines per hour, as it was able to pick the item and its respective quantity more quickly.

Episode Length

This metric is a measure of how many steps elapse in the game engine. A decrease in Episode Length implies that the task of completing all orders in the system has been carried out more efficiently.

Average AGV pick rate

This metric is a measure of the pick rate (in order-lines per hour) of AGVs, averaged across all AGVs. This metric provides a more fine-grained view of the AGV influence on overall pick rate without pickers considered.

A higher pick rate implies the AGVs are attaining picks more frequently, independent of pickers.

Average picker pick rate

This metric is a measure of the pick rate (in order-lines per hour) of pickers, averaged across all pickers. This metric provides a more fine-grained view of the picker influence on overall pick rate without AGVs considered.

A higher pick rate implies the pickers are attaining picks more frequently, independent of AGVs.

Average AGV returns

This metric is a measure of the returns AGV's are able to attain throughout training, averaged across all AGVs.

Average picker returns

This metric is a measure of the returns pickers are able to attain throughout training, averaged across all AGVs.

Average AGV travel distance

This metric is a measure of the distance an AGV travels within each episode, measured in metres. A decrease implies AGVs are learning to travel less distance.

Note that since the AGV orders are parameterised by an average number of order-lines and order-lines are uniformly distributed throughout the warehouse, the distance travelled between episodes should be similar for a fixed policy.

It is meaningful to note that distance travelled reduces as training occurs, even though distance is not directly optimized for (ie. there is no reward given related to distance travelled).

Average picker travel distance

This metric is a measure of the distance a picker travels within each episode, measured in metres. A decrease implies pickers are learning to travel less distance.

It is meaningful to note that distance travelled reduces as training occurs, even though distance is not directly optimized for (ie. there is no reward given related to distance travelled).

Average AGV idle time

This metric is a measure of time an AGV is waiting (ie. not moving) within an episode, measured in seconds. A decrease implies AGVs are moving around more.

Direct interpretation of this metric is difficult as there is a trade-off in efficiency of actions between higher and lower idle times. For instance, a higher idle time may in some cases be a good thing, as it means that agents are not uselessly travelling (ie. more efficient in movement). However, it may also mean that the agent is being underutilized.

In the inverse case, a lower idle time may be good as it means the agent is less underutilized, but may also be uselessly travelling.

Average picker idle time

This metric is a measure of time an picker is waiting (ie. not moving) within an episode, measured in seconds. A decrease implies pickers are moving around more.

The interpretation of this metric is difficult in the same way that has been outlined in the AGV section above.

Worker Analysis

A major distinction between PDM and FM is how they utilise pickers. From the view point of the picker, PDM assigns pickers explicitly to zones, whereas within FM pickers are allowed to travel more freely. Within this section, we investigate how HSNAC and SNAC utilise pickers. To do this, we go through the following process for each run:

Individually collect each pickers order-line completion locations.
Aggregate and normalise data for each picker.
Measure picker-to-picker cosine similarity.
Plot heatmap.

Shown below are the heat maps for PDM and FM in the Large warehouse configuration. These demonstrate the zoning of PDM and the freedom of FM which has approximately uniform cosine similarities between all pickers.

The equivalent figures are shown for four seeds of IAC, SNAC, SEAC, HIAC, HSNAC and HSEAC below for the Large warehouse configuration. In the hierarchical algorithm figures we observe picker correlations which suggest a degree of zoning is emerging. Pickers seem to overlap significantly with one or two other picker, implying that they are sharing the same area of the warehouse. This structure does not emerge for SNAC, and only to a minor degree for IAC and SEAC.