Data engineering often starts with batch processing, transforming them, and saving results for analytics. In this article, we explore how to build a batch pipeline for NYC Yellow Taxi trip data using Apache Spark and PySpark. This exercise is based on Module 6 of the Data Engineering Zoomcamp and demonstrates practical batch processing techniques.
Spark 4.x requires Java 17, so the first step was to install the Adoptium JDK 17:
wget https://github.com/adoptium/temurin17-binaries/releases/download/jdk-17.0.18%2B8/OpenJDK17U-jdk_x64_windows_hotspot_17.0.18_8.zip
unzip OpenJDK17U-jdk_x64_windows_hotspot_17.0.18_8.zip -d /c/tools/
We then configured the environment variables in Bash:
export JAVA_HOME="/c/tools/jdk-17.0.18+8"
export PATH="${JAVA_HOME}/bin:${PATH}"
Verification:
java --version
Next, we installed PySpark, which provides the Python API for Spark:
pip install pyspark
This installation includes a bundled Spark distribution, so no separate Spark download is required.
A Spark session is necessary to work with Spark DataFrames:
This confirms that Spark is ready to process data locally.
For this assignment, we used Yellow Taxi trip data for November 2025:
We loaded the Parquet file into a Spark DataFrame:
The dataset contains key fields such as pickup_datetime, dropoff_datetime, trip_distance, and fare_amount
Batch processing is typically done on a static dataset. We repartitioned the data into 4 partitions to optimize parallel processing and then saved it as Parquet files:
Repartitioning ensures that Spark writes multiple files in parallel, improving I/O performance.
To count trips on November 15, 2025:
Calculating the Longest Trip
We added a new column to compute trip duration in hours:
Enriching Data with Taxi Zones
To find the least frequent pickup location, we used the NYC Taxi zone lookup file:
Answering further questions with Taxi Zones
Least frequent pickup location zone:
Spark User Interface
While the batch pipeline is running, Spark provides a UI dashboard on port 4040 to monitor jobs, stages, and tasks. This is useful for debugging and performance analysis.
Marimo Notebook
This is just like the Jupyter Notebook but for Spark. To open a new notebook, use ‘edit’ like this:
Key Takeaways
Batch processing works on static datasets and is suitable for scheduled analytics or ETL tasks.
Repartitioning improves performance by parallelizing tasks and output file writing.
Incremental enrichment can be done with small lookup tables.
Spark SQL allows powerful aggregation and filtering operations without writing complex loops.
Spark’s UI dashboard provides visibility into job execution and resource usage.
Here’s my homework solution: https://github.com/stephandoh/zoomcamp642497
You can sign up for this amazing course here: https://github.com/DataTalksClub/data-engineering-zoomcamp/