For Module 2 of the Data Engineering Zoomcamp, the focus shifted from infrastructure to workflow orchestration, where pipelines begin to resemble real production systems.
This module centered on extending an existing ELT workflow to handle historical backfills, larger data volumes, and stronger orchestration practices.
Here’s what I built and learned.
Extend the existing ELT workflow to include NYC Taxi data for the year 2021, while keeping the pipeline structured, repeatable, and automated.
Kestra serves as the orchestration engine coordinating downloads, loading, transformations, and scheduling.
How the Pipeline Works
Kestra downloads compressed NYC Taxi CSV files from GitHub releases and unzips them during execution.
Each file follows this naming pattern:
{taxi_type}_tripdata_{year}-{month}.csv
Example:
green_tripdata_2020-04.csv
Data is first loaded into staging tables in PostgreSQL:
yellow_tripdata_staging
green_tripdata_staging
Why staging tables?
Isolate raw loads
Allow cleaning and validation before merging
Prevent incomplete or corrupt data from reaching final tables
A unique_row_id is generated using trip attributes to prevent duplicates.
From staging, data is merged into final tables:
yellow_tripdata
green_tripdata
The merge logic:
Inserts only new records
Prevents duplicates using the generated hash ID
Preserves historical data across runs
This makes the pipeline idempotent, meaning it can be rerun safely without duplicating data.
Backfilling Historical Data
The main task in this module was backfilling 2021 data.
Instead of modifying the main pipeline, a separate backfill flow was created in Kestra that:
Loops over taxi types (yellow and green)
Loops over months (January to July 2021)
Triggers the main ELT flow as a subflow
This mirrors real-world data engineering practices for:
Late-arriving data
Historical reprocessing
Data corrections
Key Learnings
1. Orchestration is as important as SQL
Many pipeline failures stem from orchestration issues rather than query logic.
2. Staging layers are essential
They provide a buffer between raw data and analytics-ready tables.
3. Backfills are part of real data engineering
Historical reprocessing is common and must be handled systematically.
4. Idempotency is critical
Pipelines should be safe to rerun without breaking data integrity.
Full solution and workflows:
https://github.com/stephandoh/zoomcamp27876