ETL + Kestra + PostgreSQL + GCP + Docker

Building an End-to-End ELT Pipeline with Kestra, Postgres & NYC Taxi Data

For Module 2 of the Data Engineering Zoomcamp, the focus shifted from infrastructure to workflow orchestration, where pipelines begin to resemble real production systems.

This module centered on extending an existing ELT workflow to handle historical backfills, larger data volumes, and stronger orchestration practices.

Here’s what I built and learned.

Objective

Extend the existing ELT workflow to include NYC Taxi data for the year 2021, while keeping the pipeline structured, repeatable, and automated.

Kestra serves as the orchestration engine coordinating downloads, loading, transformations, and scheduling.

How the Pipeline Works

1. Extraction

Kestra downloads compressed NYC Taxi CSV files from GitHub releases and unzips them during execution.

Each file follows this naming pattern:

{taxi_type}_tripdata_{year}-{month}.csv

Example:

green_tripdata_2020-04.csv

2. Loading (Staging Layer)

Data is first loaded into staging tables in PostgreSQL:

yellow_tripdata_staging
green_tripdata_staging

Why staging tables?

Isolate raw loads
Allow cleaning and validation before merging
Prevent incomplete or corrupt data from reaching final tables

A unique_row_id is generated using trip attributes to prevent duplicates.

3. Transformation and Merge

From staging, data is merged into final tables:

yellow_tripdata
green_tripdata

The merge logic:

Inserts only new records
Prevents duplicates using the generated hash ID
Preserves historical data across runs

This makes the pipeline idempotent, meaning it can be rerun safely without duplicating data.

Backfilling Historical Data

The main task in this module was backfilling 2021 data.

Instead of modifying the main pipeline, a separate backfill flow was created in Kestra that:

Loops over taxi types (yellow and green)
Loops over months (January to July 2021)
Triggers the main ELT flow as a subflow

This mirrors real-world data engineering practices for:

Late-arriving data
Historical reprocessing
Data corrections

Key Learnings

1. Orchestration is as important as SQL
Many pipeline failures stem from orchestration issues rather than query logic.

2. Staging layers are essential
They provide a buffer between raw data and analytics-ready tables.

3. Backfills are part of real data engineering
Historical reprocessing is common and must be handled systematically.

4. Idempotency is critical
Pipelines should be safe to rerun without breaking data integrity.

Project Repository

Full solution and workflows:
https://github.com/stephandoh/zoomcamp27876

Page updated

Google Sites

Report abuse