Data warehousing + Bigquerry + GCP

Zoomcamp Data Engineering Module 3 Homework: Data Warehousing & BigQuery

In this project, I worked with NYC Yellow Taxi trip data (Jan–Jun 2024) to explore how BigQuery storage design decisions directly affect query performance and cost.

The focus wasn’t just on writing SQL; it was on understanding how BigQuery thinks: columnar storage, external vs native tables, partitioning, clustering, and metadata optimizations.

Press enter or click to view image in full size

BigQuery Setup

-- Create an external table using the Yellow Taxi Trip Records.

CREATE OR REPLACE EXTERNAL TABLE `project.homework_3.yellow_taxi_external`

OPTIONS (

format = 'PARQUET',

uris = ['gs://bucket/yellow_tripdata_2024-*.parquet']

);

-- Create a (regular/materialized) table in BQ using the Yellow Taxi Trip Records (do not partition or cluster this table).

CREATE OR REPLACE TABLE `project.homework_3.yellow_taxi_native` AS

SELECT *

FROM `project.homework_3.yellow_taxi_external`;

What the Data Looks Like

Total records (Jan–Jun 2024):

-- What is count of records for the 2024 Yellow Taxi Data?

SELECT COUNT(*)

FROM `yellow_taxi_native`;

Answer: 20,332,093 rows

This immediately puts us in a large-scale analytics context where storage design matters.

External vs Native Tables: Cost Behavior

Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.

What is the estimated amount of data that will be read when this query is executed on the external table and the materialized table?

SELECT DISTINCT PULocationID

FROM `yellow_taxi_native`;

SELECT DISTINCT PULocationID

FROM `yellow_taxi_external`;

Estimated bytes processed:

External Table: 0 MB

Materialized Table: 155.12 MB

Why this matters:
BigQuery’s cost estimation and optimizations behave differently depending on storage type. External Parquet tables can benefit from metadata-based optimizations, while native tables scan internal column storage.

Columnar Storage in Action

Write a query to retrieve the PULocationID from the table (not the external table) in BigQuery.

SELECT PULocationID FROM yellow_taxi_native;

-- This query will process 155.12 MB when run.

Now write a query to retrieve the PULocationID and DOLocationID on the same table.

SELECT PULocationID, DOLocationID FROM yellow_taxi_native;

-- This query will process 310.24 MB when run.

Why more bytes in the second query?

BigQuery is columnar:

Query 1 → reads 1 column
Query 2 → reads 2 columns

More columns requested = more data scanned = higher cost.

This is why selecting only needed fields is a best practice.

Data Quality Insight

How many records have a fare_amount of 0?

SELECT COUNT(*)

FROM yellow_taxi_native

WHERE fare_amount = 0;

Answer: 8,333 trips

Zero-fare trips can indicate promotions, system issues, or data quality anomalies. Small queries like this often reveal business-relevant insights.

Designing an Optimized Table

What is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy)

CREATE OR REPLACE TABLE `yellow_taxi_optimized`

PARTITION BY DATE(tpep_dropoff_datetime)

CLUSTER BY VendorID AS

SELECT *

FROM `yellow_taxi_native`;

Why this works

Partitioning in BigQuery scans only relevant data, while clustering improves performance for sorting/grouping by vendor ID.

Partitioning Impact on Cost

Write a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime 2024–03–01 and 2024–03–15 (inclusive). Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values?

SELECT DISTINCT VendorID

FROM `solar-router-483810-s0.homework_3.yellow_taxi_optimized`

WHERE tpep_dropoff_datetime >= '2024-03-01' and tpep_dropoff_datetime <= '2024-03-15';

-- this is partitioned giving 26.84 MB

SELECT DISTINCT VendorID

FROM `solar-router-483810-s0.homework_3.yellow_taxi_native`

WHERE tpep_dropoff_datetime >= '2024-03-01' and tpep_dropoff_datetime <= '2024-03-15';

-- this is non-partitioned giving 310.24 MB

-- That's more than a 10× reduction.

Partitioning enables partition pruning—BigQuery skips data outside the date range.

Where External Table Data Lives

Where is the data stored in the External Table you created?

Answer: The data is stored in Google Cloud Storage, not in BigQuery.
The external table is just a schema definition that points to files in the data lake.

Should You Always Cluster?

It is best practice in Big Query to always cluster your data?

Answer: No.

Clustering is powerful, but:

Useful when columns are frequently filtered or grouped
Not necessary for small tables or rarely queried datasets

Optimization should match query patterns, not be applied blindly.

BigQuery Metadata Optimization

Write a SELECT count(*) query FROM the materialized table you created. How many bytes does it estimate will be read? Why?

SELECT COUNT(*)

FROM yellow_taxi_native;

Estimated bytes processed: 0 B

BigQuery can answer COUNT(*) using table metadata, without scanning column data. This doesn’t apply to COUNT(column_name).

Key Takeaways

Storage design impacts cost as much as SQL does
Columnar databases reward selecting fewer columns
Partitioning is critical for time-based data
Clustering improves performance but is situational
External tables = data lake integration
BigQuery uses metadata shortcuts for some operations

Here’s my homework solution: https://github.com/stephandoh/zoomcamp37899

Following along with this amazing free course — who else is learning data engineering?

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/

Page updated

Google Sites

Report abuse