Over the past few weeks, I’ve been diving into the Data Engineering Zoomcamp by DataTalksClub, and I wanted to share a hands-on project that I recently completed. It was an incredible exercise in building a reproducible workflow for managing real-world datasets, from ingestion to SQL analysis , and I learned a lot along the way.
Here’s how I structured my setup.
I used uv to manage Python dependencies and virtual environments instead of pip + virtualenv. It keeps everything reproducible and easy to run.
uv init --python 3.13
uv add pandas pyarrow sqlalchemy psycopg2-binary click requests tqdm
Key dependencies & what they do:
pandas → Load and manipulate tabular data
pyarrow → Efficiently read Parquet files
sqlalchemy → Connect and write to Postgres
psycopg2-binary → PostgreSQL driver for Python
click → Build reusable command-line interfaces
requests → Download datasets programmatically
With uv, running a Python script becomes as simple as:
uv run python ingest_green_taxi.py
Dockerized infrastructure
To handle the database layer, I created a dedicated Docker network:
docker network create pg-network
This allowed Postgres and pgAdmin containers to communicate with each other using internal hostnames.
docker run -d \
--name pgdatabase \
--network pg-network \
-e POSTGRES_USER=root \
-e POSTGRES_PASSWORD=root \
-e POSTGRES_DB=ny_taxi \
-v ny_taxi_pgdata:/var/lib/postgresql/data \
-p 5432:5432 \
postgres:16
Purpose: Store the green taxi and zone datasets with persistent storage.
docker run -d \
--name pgadmin \
--network pg-network \
-e PGADMIN_DEFAULT_EMAIL=admin@taxi.com \
-e PGADMIN_DEFAULT_PASSWORD=root \
-v pgadmin_data:/var/lib/pgadmin \
-p 8086:80 \
dpage/pgadmin4
Purpose: Provide a web UI to inspect and query the database.
Access it at http://localhost:8086 and connect to pgdatabase using Postgres credentials.
Two datasets were used:
Green taxi trips (Parquet)
NYC taxi zones (CSV)
Instead of manually downloading, I wrote Python scripts that:
Download the dataset if it doesn’t exist (requests + Path)
Load the data into pandas
Write directly to Postgres (sqlalchemy)
Support CLI options via click for reusable, parameterized runs
Show progress for large files using tqdm
This workflow allows me to re-run the scripts at any time without overwriting or duplicating data unnecessarily.
Once the data was in Postgres, I could run queries like:
Count trips less than 1 mile in November 2025
Find the day with the longest total trip distance
Identify the pickup zone with the highest revenue on a specific date
Find the largest tip from a given pickup zone
The workflow ensured that the database was query-ready before any analysis, and everything was reproducible from scratch.
I also extended this setup to the cloud using Terraform, to create:
Google Cloud Storage buckets
BigQuery datasets
Using Terraform commands:
terraform init
terraform apply -auto-approve
terraform destroy
This allowed me to automate cloud infrastructure, making it easy to spin up or tear down resources while keeping my local environment consistent.
Docker + pgAdmin = reproducible database environment
uv = modern, deterministic Python environment management
Python scripts = automated ingestion and CLI-friendly
Terraform = infrastructure as code for cloud resources
The workflow is reusable, scalable, and production-ready
Here’s my GitHub repo with all the scripts and Docker setup:
🔗 https://github.com/stephandoh/zoomcamp159879
Following along with this amazing free course — who else is learning data engineering?
Sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/