This project implements a comprehensive data engineering solution for a city traffic department to efficiently collect, store, and analyse traffic data captured by swarm UAVs (drones) and static roadside cameras. The primary objective was to create a scalable data warehouse architecture capable of hosting vehicle trajectory data extracted from video footage, with the goal of improving urban traffic flow and supporting various analytical projects.
This solution was built around the pNEUMA dataset—a large-scale collection of naturalistic trajectories from approximately half a million vehicles recorded in downtown Athens, Greece. The data engineering pipeline was designed using modern tools and methodologies to ensure high performance, scalability, and maintainability.
The city traffic department faced several significant challenges:
Managing massive volumes of traffic data (approximately 87MB per file for each area, date, and time combination)
Creating a data infrastructure that could support multiple downstream analytical projects with different requirements
Enabling traffic engineers and data scientists to efficiently query and transform data on demand
Building a system flexible enough to accommodate future analytical needs that were not yet defined
Extract-Load-Transform (ELT) framework using modern data engineering tools: Unlike traditional Extract-Transform-Load (ETL) approaches, this ELT architecture provided greater flexibility by loading raw data into the warehouse first and enabling transformation workflows to be established on a need basis. This approach allows analytic engineers in the traffic department to adapt the data to various use cases without rebuilding the entire pipeline.
Data Storage Layer: PostgreSQL Database Served as the primary data warehouse, chosen for its robust handling of relational data and advanced features like function overloading, table inheritance, and efficient management of sparse datasets.
Data Pipeline Orchestration: Apache Airflow, Implemented as the workflow management system to automate and schedule data processing tasks. Airflow DAGs (Directed Acyclic Graphs) were designed to handle data extraction, loading, and triggering transformation processes.
Data Transformation Layer: DBT (Data Build Tool) Deployed for managing SQL-based transformations, creating data models, and maintaining version control of all transformations. DBT enabled systematic testing and documentation of the data transformation process.
Visualization & Analytics Layer: Redash, Implemented as the dashboard solution to provide traffic analysts with intuitive visualisations and data exploration capabilities.
Infrastructure Management: Docker Compose, used to containerise the entire application stack, ensuring consistency across environments and simplifying deployment.
PostgreSQL: Core data warehouse for storing traffic trajectory data; chosen for efficient handling of sparse datasets.
Apache Airflow: Workflow management system that automated and scheduled data pipeline tasks through DAGs (Directed Acyclic Graphs).
DBT (Data Build Tool): Managed data transformations, testing, documentation, and version control within the ELT framework.
Redash: Dashboard creation tool for visualizing traffic data and generating insights.
Docker Compose: Containerization solution that integrated all tools as a cohesive application.
SQLAlchemy: ORM library used to parse CSV files and migrate data to PostgreSQL.
Python & SQL: Programming foundation for pipeline components and data transformation queries.
The project began with configuring a Docker Compose environment to host all the necessary tools. This involved:
Setting up container definitions for PostgreSQL, Airflow, and Redash
Configuring networking between containers
Establishing persistent volume mappings for data storage
Setting up user authentication and security parameters
Automated table creation and schema management
Data extraction from CSV files with configurable separators
Error handling with email notifications
Sequential task dependencies to ensure data integrity
Implemented DBT models to transform the raw traffic data into analytics-ready datasets. The DBT implementation included:
Data modeling for different analytical purposes
Data quality tests to ensure accuracy
Documentation generation for data lineage
Integration with Airflow for scheduled transformations
Traffic Flow Analysis: Visualizations showing vehicle density and flow rates across different city areas
Speed Distribution: Graphs depicting average speeds by vehicle type and time of day
Congestion Identification: Heat maps highlighting congestion hotspots
Route Analysis: Visualizations of common vehicle routes and patterns
Each dashboard was designed with interactive filters allowing traffic analysts to explore the data dynamically.
pNEUMA is an open large-scale dataset of naturalistic trajectories of half a million vehicles that have been collected by a one-of-a-kind experiment by a swarm of drones in the congested downtown area of Athens, Greece. Each file for a single (area, date, time) is ~87MB of data.