Data Mining Project

Route Optimization and Recommendation System

🧭 Objective

The project focuses on addressing discrepancies in logistics operations. Specifically, it deals with the issue of drivers diverging from planned routes and load specifications, and the project’s objective is to minimize this discrepancy. The task involves creating a system to recommend optimal standard routes based on drivers’ actual behavior patterns and preferences. This includes generating recommendations for the company on standard routes, creating a list of routes for each driver that minimizes diversion, and determining an ideal standard route for each driver. The project emphasizes the development of a large synthetic dataset to test the effectiveness and efficiency ofthe proposed solution, given the unavailability of real company data.

⚙️ System Architecture

The system is designed as follows:

Data Generation Layer

Function: Creates large, diverse synthetic datasets to simulate standard and actual routes (including city sequences, merchandise types, and driver assignments).
Output: JSON files for standard and actual routes.

Data Processing & Encoding Layer

Vocabulary Module: Converts categorical attributes (cities, merchandise, drivers) into numeric codes and vice versa.
Encoder Module:
- Represents each route as a feature matrix vector (combining city order and merchandise).
- Applies normalization and dimensionality reduction (PCA) for efficient processing.

Analysis & Feature Engineering Layer

Utility Matrix Construction:
- Builds Level 1 utility matrix to capture how closely each driver follows assigned routes.
- Applies collaborative filtering (using driver similarity) to predict missing ratings, generating Level 2 utility matrix.
Driver Profiles: Computes driver similarity matrices for use in recommendations.

Clustering & Recommendation Layer

Clustering:
- Uses KMeans/KMedoids on route vectors (with extra trip-level features) to group similar routes and identify optimal cluster centers.
- Generates new standard routes based on cluster medoids/centroids.
Recommendation Engine:
- Ranks standard routes per driver based on similarity.
- Applies greedy algorithm to build personalized “ideal” routes for each driver using Level 2 utility and preference scores.

Evaluation & Output Layer

Metrics Calculation: Computes similarity (cosine, Jaccard), silhouette score, SSE, and other performance indicators.
Result Output: Exports top routes for each driver, new standard routes, and detailed evaluation metrics as JSON and visualization charts.

Project Details

🧠 Implementation Highlights

Synthetic Data Creation: Developed a robust data generator to simulate logistics operations across 65 Italian cities and 48 merchandise types, enabling extensive testing without real company data.
Flexible Route Encoding: Designed a modular pipeline that encodes routes with both city sequence and merchandise quantities, preserving critical trip order and load information for accurate analysis.
Clustering for Standard Routes: Applied KMeans and KMedoids algorithms to group similar actual routes, enabling the generation of data-driven, optimal standard routes.
Collaborative & Content-Based Filtering: Built driver profiles and applied collaborative filtering to predict unseen route preferences, supplementing with a greedy algorithm for personalized route optimization.

💡Key Features

Personalized Route Recommendations: Suggests optimal standard routes for each driver based on behavioral patterns and past performance.
Dynamic Data Simulation: Adjustable synthetic dataset generation reflecting diverse logistics scenarios.
Driver Profiling: Analyzes driving patterns, building similarity-based driver clusters for more effective recommendation.
Multi-level Utility Matrix: Robust analysis of driver/trip relationships across observed and estimated preferences.

📈 Results

Reduced Route Deviation: The optimized recommended routes consistently minimized the gap between planned and actual routes in simulations.
Stabilized Route Patterns: New standard routes generated from clusters showed lower deviation and higher overall similarity to actual driven routes.
Scalable Evaluation: The system successfully handled datasets of up to 60,000 routes, providing consistent results across varied data volumes.
Enhanced Driver Satisfaction (Hypothetical): By aligning routes with driver habits, the model demonstrated potential for increased operational efficiency and satisfaction.

🔗 Key Technologies Used

Python (core scripting and orchestration)
Scikit-learn & Scikit-learn-extra (KMeans, KMedoids, PCA)
NumPy & Pandas (data handling and transformation)
MinHash & Jaccard Similarity (route similarity evaluation)
JSON (for data storage/input/output)
Matplotlib/Seaborn

🗺️ Challenges and Open Issues

Synthetic Data Limitations: While detailed, the synthetic datasets may not capture all realisations of live logistics data (e.g., unpredictable traffic, real driver preferences).
Route Evaluation Metrics: Assessing the “real-world” quality of newly generated routes remains challenging, especially for unseen/novel route configurations.
Scalability to Production: Integrating this system with real-time data streams and in a live logistics environment would require further engineering and robustness checks.
Future Integration: Open opportunities exist to employ reinforcement learning, integrate real GPS data, or enable feedback-driven model updating for continuous improvement.