Docker + SQL

Building a Reproducible Data Engineering Workflow with Docker, Python, and Terraform

Over the past few weeks, I’ve been diving into the Data Engineering Zoomcamp by DataTalksClub, and I wanted to share a hands-on project that I recently completed. It was an incredible exercise in building a reproducible workflow for managing real-world datasets, from ingestion to SQL analysis , and I learned a lot along the way.

Here’s how I structured my setup.

Python environment and dependencies

I used uv to manage Python dependencies and virtual environments instead of pip + virtualenv. It keeps everything reproducible and easy to run.

uv init --python 3.13

uv add pandas pyarrow sqlalchemy psycopg2-binary click requests tqdm

Key dependencies & what they do:

pandas → Load and manipulate tabular data
pyarrow → Efficiently read Parquet files
sqlalchemy → Connect and write to Postgres
psycopg2-binary → PostgreSQL driver for Python
click → Build reusable command-line interfaces
requests → Download datasets programmatically

With uv, running a Python script becomes as simple as:

uv run python ingest_green_taxi.py

Dockerized infrastructure

To handle the database layer, I created a dedicated Docker network:

docker network create pg-network

This allowed Postgres and pgAdmin containers to communicate with each other using internal hostnames.

Postgres container

docker run -d \

--name pgdatabase \

--network pg-network \

-e POSTGRES_USER=root \

-e POSTGRES_PASSWORD=root \

-e POSTGRES_DB=ny_taxi \

-v ny_taxi_pgdata:/var/lib/postgresql/data \

-p 5432:5432 \

postgres:16

Purpose: Store the green taxi and zone datasets with persistent storage.

pgAdmin container

docker run -d \

--name pgadmin \

--network pg-network \

-e PGADMIN_DEFAULT_EMAIL=admin@taxi.com \

-e PGADMIN_DEFAULT_PASSWORD=root \

-v pgadmin_data:/var/lib/pgadmin \

-p 8086:80 \

dpage/pgadmin4

Purpose: Provide a web UI to inspect and query the database.
Access it at http://localhost:8086 and connect to pgdatabase using Postgres credentials.

Automated data ingestion

Two datasets were used:

Green taxi trips (Parquet)
NYC taxi zones (CSV)

Instead of manually downloading, I wrote Python scripts that:

Download the dataset if it doesn’t exist (requests + Path)
Load the data into pandas
Write directly to Postgres (sqlalchemy)
Support CLI options via click for reusable, parameterized runs
Show progress for large files using tqdm

This workflow allows me to re-run the scripts at any time without overwriting or duplicating data unnecessarily.

SQL analysis

Once the data was in Postgres, I could run queries like:

Count trips less than 1 mile in November 2025
Find the day with the longest total trip distance
Identify the pickup zone with the highest revenue on a specific date
Find the largest tip from a given pickup zone

The workflow ensured that the database was query-ready before any analysis, and everything was reproducible from scratch.

Terraform and GCP

I also extended this setup to the cloud using Terraform, to create:

Google Cloud Storage buckets
BigQuery datasets

Using Terraform commands:

terraform init

terraform apply -auto-approve

terraform destroy

This allowed me to automate cloud infrastructure, making it easy to spin up or tear down resources while keeping my local environment consistent.

Key takeaways

Docker + pgAdmin = reproducible database environment
uv = modern, deterministic Python environment management
Python scripts = automated ingestion and CLI-friendly
Terraform = infrastructure as code for cloud resources
The workflow is reusable, scalable, and production-ready

Here’s my GitHub repo with all the scripts and Docker setup:
🔗 https://github.com/stephandoh/zoomcamp159879

Following along with this amazing free course — who else is learning data engineering?
Sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/

Page updated

Google Sites

Report abuse