In this project, I worked with NYC Yellow Taxi trip data (Jan–Jun 2024) to explore how BigQuery storage design decisions directly affect query performance and cost.
The focus wasn’t just on writing SQL; it was on understanding how BigQuery thinks: columnar storage, external vs native tables, partitioning, clustering, and metadata optimizations.
Press enter or click to view image in full size
BigQuery Setup
-- Create an external table using the Yellow Taxi Trip Records.
CREATE OR REPLACE EXTERNAL TABLE `project.homework_3.yellow_taxi_external`
OPTIONS (
format = 'PARQUET',
uris = ['gs://bucket/yellow_tripdata_2024-*.parquet']
);
-- Create a (regular/materialized) table in BQ using the Yellow Taxi Trip Records (do not partition or cluster this table).
CREATE OR REPLACE TABLE `project.homework_3.yellow_taxi_native` AS
SELECT *
FROM `project.homework_3.yellow_taxi_external`;
Total records (Jan–Jun 2024):
-- What is count of records for the 2024 Yellow Taxi Data?
SELECT COUNT(*)
FROM `yellow_taxi_native`;
Answer: 20,332,093 rows
This immediately puts us in a large-scale analytics context where storage design matters.
Write a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.
What is the estimated amount of data that will be read when this query is executed on the external table and the materialized table?
SELECT DISTINCT PULocationID
FROM `yellow_taxi_native`;
SELECT DISTINCT PULocationID
FROM `yellow_taxi_external`;
Estimated bytes processed:
External Table: 0 MB
Materialized Table: 155.12 MB
Why this matters:
BigQuery’s cost estimation and optimizations behave differently depending on storage type. External Parquet tables can benefit from metadata-based optimizations, while native tables scan internal column storage.
Write a query to retrieve the PULocationID from the table (not the external table) in BigQuery.
SELECT PULocationID FROM yellow_taxi_native;
-- This query will process 155.12 MB when run.
Now write a query to retrieve the PULocationID and DOLocationID on the same table.
SELECT PULocationID, DOLocationID FROM yellow_taxi_native;
-- This query will process 310.24 MB when run.
Why more bytes in the second query?
BigQuery is columnar:
Query 1 → reads 1 column
Query 2 → reads 2 columns
More columns requested = more data scanned = higher cost.
This is why selecting only needed fields is a best practice.
How many records have a fare_amount of 0?
SELECT COUNT(*)
FROM yellow_taxi_native
WHERE fare_amount = 0;
Answer: 8,333 trips
Zero-fare trips can indicate promotions, system issues, or data quality anomalies. Small queries like this often reveal business-relevant insights.
What is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy)
CREATE OR REPLACE TABLE `yellow_taxi_optimized`
PARTITION BY DATE(tpep_dropoff_datetime)
CLUSTER BY VendorID AS
SELECT *
FROM `yellow_taxi_native`;
Why this works
Partitioning in BigQuery scans only relevant data, while clustering improves performance for sorting/grouping by vendor ID.
Write a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime 2024–03–01 and 2024–03–15 (inclusive). Use the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values?
SELECT DISTINCT VendorID
FROM `solar-router-483810-s0.homework_3.yellow_taxi_optimized`
WHERE tpep_dropoff_datetime >= '2024-03-01' and tpep_dropoff_datetime <= '2024-03-15';
-- this is partitioned giving 26.84 MB
SELECT DISTINCT VendorID
FROM `solar-router-483810-s0.homework_3.yellow_taxi_native`
WHERE tpep_dropoff_datetime >= '2024-03-01' and tpep_dropoff_datetime <= '2024-03-15';
-- this is non-partitioned giving 310.24 MB
-- That's more than a 10× reduction.
Partitioning enables partition pruning—BigQuery skips data outside the date range.
Where is the data stored in the External Table you created?
Answer: The data is stored in Google Cloud Storage, not in BigQuery.
The external table is just a schema definition that points to files in the data lake.
It is best practice in Big Query to always cluster your data?
Answer: No.
Clustering is powerful, but:
Useful when columns are frequently filtered or grouped
Not necessary for small tables or rarely queried datasets
Optimization should match query patterns, not be applied blindly.
Write a SELECT count(*) query FROM the materialized table you created. How many bytes does it estimate will be read? Why?
SELECT COUNT(*)
FROM yellow_taxi_native;
Estimated bytes processed: 0 B
BigQuery can answer COUNT(*) using table metadata, without scanning column data. This doesn’t apply to COUNT(column_name).
Storage design impacts cost as much as SQL does
Columnar databases reward selecting fewer columns
Partitioning is critical for time-based data
Clustering improves performance but is situational
External tables = data lake integration
BigQuery uses metadata shortcuts for some operations
Here’s my homework solution: https://github.com/stephandoh/zoomcamp37899
Following along with this amazing free course — who else is learning data engineering?
You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/