analytics engineering with dbt

Data Engineering Zoomcamp: Homework 4: Analytics Engineering with dbt

I just wrapped up Homework 4 of the Data Engineering Zoomcamp by @DataTalksClub, focused on Analytics Engineering with dbt. Here’s a breakdown of my workflow, the challenges I faced, and the solutions I implemented.

Loading NYC Taxi Data into BigQuery

The first step was to download NYC Taxi data for 2019 and 2020 from the NYC TLC repository. I wrote a Python script to:

Download CSV files for Yellow, Green, and FHV taxis.
Upload them to my Google Cloud Storage bucket.
Prepare them for BigQuery by creating external tables.

Example: creating an external table for Green taxi 2019 and 2020 data:

CREATE OR REPLACE EXTERNAL TABLE `solar-router-483810-s0.homework4.yellow_taxi`

OPTIONS (

format = 'PARQUET',

uris = [

'gs://dezoomcamp_hw4_2026/yellow_tripdata_2019-*.parquet',

'gs://dezoomcamp_hw4_2026/yellow_tripdata_2020-*.parquet'

]

);

Connecting dbt Core to BigQuery

I set up dbt Core in VSCode using the BigQuery adapter. After configuring profiles.yml to point to my project and dataset, I could use dbt to:

Build staging, intermediate, and fact models
Test data quality
Manage model dependencies and lineage

Staging Models

To clean and standardize the data, I created staging models:

Example: Yellow taxi staging (stg_yellow_tripdata)

SELECT

-- identifiers

CAST(VendorID AS INT64) AS vendor_id,

CAST(RatecodeID AS FLOAT64) AS ratecode_id,

CAST(payment_type AS FLOAT64) AS payment_type,

CAST(NULL AS FLOAT64) AS trip_type, -- missing in Yellow Taxi

-- timestamps

CAST(tpep_pickup_datetime AS TIMESTAMP) AS pickup_datetime,

CAST(tpep_dropoff_datetime AS TIMESTAMP) AS dropoff_datetime,

-- locations

CAST(PULocationID AS INT64) AS pickup_location_id,

CAST(DOLocationID AS INT64) AS dropoff_location_id,

-- trip info

CAST(store_and_fwd_flag AS STRING) AS store_and_fwd_flag,

CAST(passenger_count AS FLOAT64) AS passenger_count,

CAST(trip_distance AS FLOAT64) AS trip_distance,

-- payment info

CAST(fare_amount AS FLOAT64) AS fare_amount,

CAST(extra AS FLOAT64) AS extra,

CAST(mta_tax AS FLOAT64) AS mta_tax,

CAST(tip_amount AS FLOAT64) AS tip_amount,

CAST(tolls_amount AS FLOAT64) AS tolls_amount,

CAST(improvement_surcharge AS FLOAT64) AS improvement_surcharge,

CAST(congestion_surcharge AS FLOAT64) AS congestion_surcharge,

CAST(total_amount AS FLOAT64) AS total_amount,

'yellow' AS taxi_color

FROM {{ source('raw_data', 'yellow_taxi') }}

WHERE VendorID IS NOT NULL

Added a taxi_color column to differentiate Yellow vs Green taxis.
Ensured all columns match the project naming convention.

FHV Taxi Staging (stg_fhv_tripdata) followed a similar approach:

SELECT

dispatching_base_num,

pickup_datetime,

dropOff_datetime AS dropoff_datetime,

PUlocationID AS pickup_location_id,

DOlocationID AS dropoff_location_id,

SR_Flag AS sr_flag,

Affiliated_base_number AS affiliated_base_number

FROM {{ source('raw_data', 'fhv_tripdata_2019') }}

WHERE dispatching_base_num IS NOT NULL

Intermediate and Fact Models

I built int_trips_unioned to union all taxis:

WITH green_tripdata AS (

SELECT * FROM {{ ref('stg_green_tripdata') }}

yellow_tripdata AS (

SELECT * FROM {{ ref('stg_yellow_tripdata') }}

trips_unioned AS (

SELECT * FROM green_tripdata

UNION ALL

SELECT * FROM yellow_tripdata

)

SELECT * FROM trips_unioned

Then I created fact tables like fct_trips and fct_monthly_zone_revenue to aggregate data for analytics:

fct_trips: all taxi trips
fct_monthly_zone_revenue: monthly total revenue per zone, calculated from total_amount grouped by pickup zone and month

Example query to create fct_monthly_zone_revenue:

WITH trips AS (

SELECT

pickup_location_id,

EXTRACT(YEAR FROM pickup_datetime) AS year,

EXTRACT(MONTH FROM pickup_datetime) AS month,

taxi_color,

total_amount

FROM {{ ref('fct_trips') }}

)

SELECT

t.pickup_location_id,

z.zone AS pickup_zone_name,

t.year,

t.month,

t.taxi_color,

COUNT(*) AS total_monthly_trips,

SUM(total_amount) AS revenue_monthly_total_amount

FROM trips t

LEFT JOIN {{ ref('dim_zones') }} z

ON t.pickup_location_id = z.location_id

GROUP BY 1, 2, 3, 4, 5

ORDER BY revenue_monthly_total_amount DESC

Question 3: Count all records in fct_monthly_zone_revenue

SELECT COUNT(*) AS count_fct_monthly_zone_revenue

FROM {{ ref('fct_monthly_zone_revenue') }};

Question 4: Best Performing Zone for Green Taxis (2020)

SELECT dz.zone AS pickup_zone,

SUM(fm.revenue_monthly_total_amount) AS total_revenue

FROM {{ ref('fct_monthly_zone_revenue') }} AS fm

JOIN {{ ref('dim_zones') }} AS dz

ON fm.pickup_location_id = dz.location_id

WHERE fm.taxi_color = 'green'

AND fm.year = 2020

GROUP BY dz.zone

ORDER BY total_revenue DESC

LIMIT 1;

Question 5: Total Green Taxi Trips in October 2019

SELECT

SUM(total_monthly_trips) AS total_green_taxi_trips_oct2019

FROM {{ ref('fct_monthly_zone_revenue') }}

WHERE taxi_color = 'green'

AND year = 2019

AND month = 10;

Question 6: Staging Model for FHV Trips (2019)

-- stg_fhv_tripdata.sql

SELECT

dispatching_base_num,

pickup_datetime,

dropOff_datetime,

PUlocationID AS pickup_location_id,

DOlocationID AS dropoff_location_id,

SR_Flag,

Affiliated_base_number

FROM {{ source('raw_data', 'fhv_tripdata_2019') }}

WHERE dispatching_base_num IS NOT NULL;

Count query:

SELECT COUNT(*) AS count_stg_fhv_tripdata

FROM {{ ref('stg_fhv_tripdata') }};

5 Key Learnings

Learned to stage raw data before creating fact tables
Managed column inconsistencies (like missing fields and adding taxi_color)
Built intermediate and fact tables for analytics-ready data
Used dbt to manage dependencies, run tests, and materialize models

My solution: https://github.com/stephandoh/zoomcamp4135556

Free course by @DataTalksClub: https://github.com/DataTalksClub/data-engineering-zoomcamp/

Page updated

Google Sites

Report abuse