The project focuses on addressing discrepancies in logistics operations. Specifically, it deals with the issue of drivers diverging from planned routes and load specifications, and the project’s objective is to minimize this discrepancy. The task involves creating a system to recommend optimal standard routes based on drivers’ actual behavior patterns and preferences. This includes generating recommendations for the company on standard routes, creating a list of routes for each driver that minimizes diversion, and determining an ideal standard route for each driver. The project emphasizes the development of a large synthetic dataset to test the effectiveness and efficiency ofthe proposed solution, given the unavailability of real company data.
Data Generation Layer
Function: Creates large, diverse synthetic datasets to simulate standard and actual routes (including city sequences, merchandise types, and driver assignments).
Output: JSON files for standard and actual routes.
Data Processing & Encoding Layer
Vocabulary Module: Converts categorical attributes (cities, merchandise, drivers) into numeric codes and vice versa.
Encoder Module:
Represents each route as a feature matrix vector (combining city order and merchandise).
Applies normalization and dimensionality reduction (PCA) for efficient processing.
Analysis & Feature Engineering Layer
Utility Matrix Construction:
Builds Level 1 utility matrix to capture how closely each driver follows assigned routes.
Applies collaborative filtering (using driver similarity) to predict missing ratings, generating Level 2 utility matrix.
Driver Profiles: Computes driver similarity matrices for use in recommendations.
Clustering & Recommendation Layer
Clustering:
Uses KMeans/KMedoids on route vectors (with extra trip-level features) to group similar routes and identify optimal cluster centers.
Generates new standard routes based on cluster medoids/centroids.
Recommendation Engine:
Ranks standard routes per driver based on similarity.
Applies greedy algorithm to build personalized “ideal” routes for each driver using Level 2 utility and preference scores.
Evaluation & Output Layer
Metrics Calculation: Computes similarity (cosine, Jaccard), silhouette score, SSE, and other performance indicators.
Result Output: Exports top routes for each driver, new standard routes, and detailed evaluation metrics as JSON and visualization charts.
Synthetic Data Creation: Developed a robust data generator to simulate logistics operations across 65 Italian cities and 48 merchandise types, enabling extensive testing without real company data.
Flexible Route Encoding: Designed a modular pipeline that encodes routes with both city sequence and merchandise quantities, preserving critical trip order and load information for accurate analysis.
Clustering for Standard Routes: Applied KMeans and KMedoids algorithms to group similar actual routes, enabling the generation of data-driven, optimal standard routes.
Collaborative & Content-Based Filtering: Built driver profiles and applied collaborative filtering to predict unseen route preferences, supplementing with a greedy algorithm for personalized route optimization.
Personalized Route Recommendations: Suggests optimal standard routes for each driver based on behavioral patterns and past performance.
Dynamic Data Simulation: Adjustable synthetic dataset generation reflecting diverse logistics scenarios.
Driver Profiling: Analyzes driving patterns, building similarity-based driver clusters for more effective recommendation.
Multi-level Utility Matrix: Robust analysis of driver/trip relationships across observed and estimated preferences.
Reduced Route Deviation: The optimized recommended routes consistently minimized the gap between planned and actual routes in simulations.
Stabilized Route Patterns: New standard routes generated from clusters showed lower deviation and higher overall similarity to actual driven routes.
Scalable Evaluation: The system successfully handled datasets of up to 60,000 routes, providing consistent results across varied data volumes.
Enhanced Driver Satisfaction (Hypothetical): By aligning routes with driver habits, the model demonstrated potential for increased operational efficiency and satisfaction.
Python (core scripting and orchestration)
Scikit-learn & Scikit-learn-extra (KMeans, KMedoids, PCA)
NumPy & Pandas (data handling and transformation)
MinHash & Jaccard Similarity (route similarity evaluation)
JSON (for data storage/input/output)
Matplotlib/Seaborn
Synthetic Data Limitations: While detailed, the synthetic datasets may not capture all realisations of live logistics data (e.g., unpredictable traffic, real driver preferences).
Route Evaluation Metrics: Assessing the “real-world” quality of newly generated routes remains challenging, especially for unseen/novel route configurations.
Scalability to Production: Integrating this system with real-time data streams and in a live logistics environment would require further engineering and robustness checks.
Future Integration: Open opportunities exist to employ reinforcement learning, integrate real GPS data, or enable feedback-driven model updating for continuous improvement.
Professor at Utrecht University
Chair on Very Large Data Management
Leader of Master's in Data Science