Building a Batch Pipeline for NYC Taxi Data with Spark

Introduction

Data engineering often starts with batch processing, transforming them, and saving results for analytics. In this article, we explore how to build a batch pipeline for NYC Yellow Taxi trip data using Apache Spark and PySpark. This exercise is based on Module 6 of the Data Engineering Zoomcamp and demonstrates practical batch processing techniques.

Setting Up the Environment

Spark 4.x requires Java 17, so the first step was to install the Adoptium JDK 17:

wget https://github.com/adoptium/temurin17-binaries/releases/download/jdk-17.0.18%2B8/OpenJDK17U-jdk_x64_windows_hotspot_17.0.18_8.zip

unzip OpenJDK17U-jdk_x64_windows_hotspot_17.0.18_8.zip -d /c/tools/

We then configured the environment variables in Bash:

export JAVA_HOME="/c/tools/jdk-17.0.18+8"

export PATH="${JAVA_HOME}/bin:${PATH}"

Verification:

java --version

Next, we installed PySpark, which provides the Python API for Spark:

pip install pyspark

This installation includes a bundled Spark distribution, so no separate Spark download is required.

Creating a Spark Session

A Spark session is necessary to work with Spark DataFrames:

This confirms that Spark is ready to process data locally.

Loading the Data

For this assignment, we used Yellow Taxi trip data for November 2025:

We loaded the Parquet file into a Spark DataFrame:

The dataset contains key fields such as pickup_datetime, dropoff_datetime, trip_distance, and fare_amount

Repartitioning and Saving Data

Batch processing is typically done on a static dataset. We repartitioned the data into 4 partitions to optimize parallel processing and then saved it as Parquet files:

Repartitioning ensures that Spark writes multiple files in parallel, improving I/O performance.

Batch Queries

To count trips on November 15, 2025:

Calculating the Longest Trip

We added a new column to compute trip duration in hours:

Enriching Data with Taxi Zones

To find the least frequent pickup location, we used the NYC Taxi zone lookup file:

Answering further questions with Taxi Zones

Least frequent pickup location zone:

Spark User Interface

While the batch pipeline is running, Spark provides a UI dashboard on port 4040 to monitor jobs, stages, and tasks. This is useful for debugging and performance analysis.

Marimo Notebook

This is just like the Jupyter Notebook but for Spark. To open a new notebook, use ‘edit’ like this:

Key Takeaways

Batch processing works on static datasets and is suitable for scheduled analytics or ETL tasks.
Repartitioning improves performance by parallelizing tasks and output file writing.
Incremental enrichment can be done with small lookup tables.
Spark SQL allows powerful aggregation and filtering operations without writing complex loops.
Spark’s UI dashboard provides visibility into job execution and resource usage.

Here’s my homework solution: https://github.com/stephandoh/zoomcamp642497

You can sign up for this amazing course here: https://github.com/DataTalksClub/data-engineering-zoomcamp/

Page updated

Google Sites

Report abuse