I just wrapped up Homework 4 of the Data Engineering Zoomcamp by @DataTalksClub, focused on Analytics Engineering with dbt. Here’s a breakdown of my workflow, the challenges I faced, and the solutions I implemented.
The first step was to download NYC Taxi data for 2019 and 2020 from the NYC TLC repository. I wrote a Python script to:
Download CSV files for Yellow, Green, and FHV taxis.
Upload them to my Google Cloud Storage bucket.
Prepare them for BigQuery by creating external tables.
Example: creating an external table for Green taxi 2019 and 2020 data:
CREATE OR REPLACE EXTERNAL TABLE `solar-router-483810-s0.homework4.yellow_taxi`
OPTIONS (
format = 'PARQUET',
uris = [
'gs://dezoomcamp_hw4_2026/yellow_tripdata_2019-*.parquet',
'gs://dezoomcamp_hw4_2026/yellow_tripdata_2020-*.parquet'
]
);
I set up dbt Core in VSCode using the BigQuery adapter. After configuring profiles.yml to point to my project and dataset, I could use dbt to:
Build staging, intermediate, and fact models
Test data quality
Manage model dependencies and lineage
To clean and standardize the data, I created staging models:
Example: Yellow taxi staging (stg_yellow_tripdata)
SELECT
-- identifiers
CAST(VendorID AS INT64) AS vendor_id,
CAST(RatecodeID AS FLOAT64) AS ratecode_id,
CAST(payment_type AS FLOAT64) AS payment_type,
CAST(NULL AS FLOAT64) AS trip_type, -- missing in Yellow Taxi
-- timestamps
CAST(tpep_pickup_datetime AS TIMESTAMP) AS pickup_datetime,
CAST(tpep_dropoff_datetime AS TIMESTAMP) AS dropoff_datetime,
-- locations
CAST(PULocationID AS INT64) AS pickup_location_id,
CAST(DOLocationID AS INT64) AS dropoff_location_id,
-- trip info
CAST(store_and_fwd_flag AS STRING) AS store_and_fwd_flag,
CAST(passenger_count AS FLOAT64) AS passenger_count,
CAST(trip_distance AS FLOAT64) AS trip_distance,
-- payment info
CAST(fare_amount AS FLOAT64) AS fare_amount,
CAST(extra AS FLOAT64) AS extra,
CAST(mta_tax AS FLOAT64) AS mta_tax,
CAST(tip_amount AS FLOAT64) AS tip_amount,
CAST(tolls_amount AS FLOAT64) AS tolls_amount,
CAST(improvement_surcharge AS FLOAT64) AS improvement_surcharge,
CAST(congestion_surcharge AS FLOAT64) AS congestion_surcharge,
CAST(total_amount AS FLOAT64) AS total_amount,
'yellow' AS taxi_color
FROM {{ source('raw_data', 'yellow_taxi') }}
WHERE VendorID IS NOT NULL
Added a taxi_color column to differentiate Yellow vs Green taxis.
Ensured all columns match the project naming convention.
FHV Taxi Staging (stg_fhv_tripdata) followed a similar approach:
SELECT
dispatching_base_num,
pickup_datetime,
dropOff_datetime AS dropoff_datetime,
PUlocationID AS pickup_location_id,
DOlocationID AS dropoff_location_id,
SR_Flag AS sr_flag,
Affiliated_base_number AS affiliated_base_number
FROM {{ source('raw_data', 'fhv_tripdata_2019') }}
WHERE dispatching_base_num IS NOT NULL
I built int_trips_unioned to union all taxis:
WITH green_tripdata AS (
SELECT * FROM {{ ref('stg_green_tripdata') }}
),
yellow_tripdata AS (
SELECT * FROM {{ ref('stg_yellow_tripdata') }}
),
trips_unioned AS (
SELECT * FROM green_tripdata
UNION ALL
SELECT * FROM yellow_tripdata
)
SELECT * FROM trips_unioned
Then I created fact tables like fct_trips and fct_monthly_zone_revenue to aggregate data for analytics:
fct_trips: all taxi trips
fct_monthly_zone_revenue: monthly total revenue per zone, calculated from total_amount grouped by pickup zone and month
Example query to create fct_monthly_zone_revenue:
WITH trips AS (
SELECT
pickup_location_id,
EXTRACT(YEAR FROM pickup_datetime) AS year,
EXTRACT(MONTH FROM pickup_datetime) AS month,
taxi_color,
total_amount
FROM {{ ref('fct_trips') }}
)
SELECT
t.pickup_location_id,
z.zone AS pickup_zone_name,
t.year,
t.month,
t.taxi_color,
COUNT(*) AS total_monthly_trips,
SUM(total_amount) AS revenue_monthly_total_amount
FROM trips t
LEFT JOIN {{ ref('dim_zones') }} z
ON t.pickup_location_id = z.location_id
GROUP BY 1, 2, 3, 4, 5
ORDER BY revenue_monthly_total_amount DESC
SELECT COUNT(*) AS count_fct_monthly_zone_revenue
FROM {{ ref('fct_monthly_zone_revenue') }};
SELECT dz.zone AS pickup_zone,
SUM(fm.revenue_monthly_total_amount) AS total_revenue
FROM {{ ref('fct_monthly_zone_revenue') }} AS fm
JOIN {{ ref('dim_zones') }} AS dz
ON fm.pickup_location_id = dz.location_id
WHERE fm.taxi_color = 'green'
AND fm.year = 2020
GROUP BY dz.zone
ORDER BY total_revenue DESC
LIMIT 1;
SELECT
SUM(total_monthly_trips) AS total_green_taxi_trips_oct2019
FROM {{ ref('fct_monthly_zone_revenue') }}
WHERE taxi_color = 'green'
AND year = 2019
AND month = 10;
-- stg_fhv_tripdata.sql
SELECT
dispatching_base_num,
pickup_datetime,
dropOff_datetime,
PUlocationID AS pickup_location_id,
DOlocationID AS dropoff_location_id,
SR_Flag,
Affiliated_base_number
FROM {{ source('raw_data', 'fhv_tripdata_2019') }}
WHERE dispatching_base_num IS NOT NULL;
Count query:
SELECT COUNT(*) AS count_stg_fhv_tripdata
FROM {{ ref('stg_fhv_tripdata') }};
Learned to stage raw data before creating fact tables
Managed column inconsistencies (like missing fields and adding taxi_color)
Built intermediate and fact tables for analytics-ready data
Used dbt to manage dependencies, run tests, and materialize models
My solution: https://github.com/stephandoh/zoomcamp4135556
Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/