Ken Ming Lee, Paul Barde, Maxime C. Cohen, Derek Nowrouzezahrai
Correspondence: ken.m.lee@mail.mcgill.ca
Proceeding of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), Paphos, Cyprus.
In the layout optimization literature, customer trajectories (i.e., movement data) are commonly used to optimize store layouts, but they are costly to obtain.
Researchers therefore often resort to heuristics like the Travelling Salesman Path (TSP) and Probabilistic Nearest Neighbour (PNN) to approximate customer trajectories.
However, research has found that shortest paths diverge from real customer paths by ~28%, limiting their usefulness.
We introduce an agent-based, maximum-entropy reinforcement learning (RL) framework that models customer behaviour with bounded rationality.
We show that RL-generated trajectories match actual customer trajectories significantly better than TSP/PNN, improving estimates of impulse purchases and shelf traffic. Moreover, only RL produces product repositioning decisions that align with those supported by real data, yielding similar predicted profit gains.
Our results suggest that RL provides a practical alternative, bridging the gap between simplistic heuristics and expensive real-world data collection.
While the paper presents trajectory insights through static heatmaps, shopping is a fundamentally human experience. To bridge that gap, we built a fully interactive 3D digital twin of the convenience store, accurately mirroring every shelf and product placement.
Navigate and interact with products in the store
Replay actual human trajectories
Switch between first-person and top-down view
Change heatmap colors
View heatmaps of actual customers and different algorithms
Hide shelves to recover the 2D store used in the paper
Note: for the best experience, please run the simulation on itch.io.
Data provided:
Raw point coordinates of customer paths
Coordinates of shelves (green) and walls (blue)
Initial discretization scheme: Grey represents walls, and purple represents store shelves
Placement of products in the simulator was determined based on physical store visits and trajectory data.
The animated red dot shows the actual human trajectory. The small grid size kept RL training time low, but the (initial) discretization was not sufficiently fine, causing human trajectories to overlap with shelves in the discretized store (shown in purple).
Final discretization scheme:
Each grid cell corresponds to a 50×50 cm area in the store
Coloured circles represent different product categories;
Dollar signs represent checkout points
The red triangle represents the RL agent’s initial position and orientation
The heatmap overlay shows the final positions of customers in the store, suggesting that the data were collected from two distinct checkout points.
Heatmap overlay of all unprocessed (discretized) trajectories, showing a good balance between discretization resolution and store complexity.
The animated red dot in the video shows the actual human trajectory, while the animated green cell shows its discretized version.
The raw trajectory does not overlap with the shelves in the simulated store, allowing the discretized trajectory to closely match the original.