DLT+DUCKDB+MCP

Building a Custom DLT Pipeline for NYC Taxi Data

In this project, I explored the full lifecycle of building a custom DLT pipeline from scratch, loading real-world NYC taxi trip data, and performing analysis using Python and DuckDB. The exercise demonstrated how modern data engineering tools, APIs, and AI assistants can work together to simplify the process of building robust data pipelines.

Press enter or click to view image in full size

The goal was to load NYC Yellow Taxi trip data from a custom REST API into a DuckDB database using a DLT pipeline and then answer analysis questions and visualize insights. Since this API had no pre-existing scaffold in DLT, I worked with the AI assistant (via the MCP server in VS Code Copilot) to build the pipeline from scratch.

The main steps included:

Create a New Project

mkdir taxi-pipeline

cd taxi-pipeline

2. Set Up the dlt MCP Server (If Not Already Done)

For my case I use vscode, so I had to create .vscode/mcp.json in my project folder:

{

"servers": {

"dlt": {

"command": "uv",

"args": [

"run",

"--with",

"dlt[duckdb]",

"--with",

"dlt-mcp[search]",

"python",

"-m",

"dlt_mcp"

]

}

3. Install dlt

pip install "dlt[workspace]"

4. Initialize the Project

dlt init dlthub:taxi_pipeline duckdb

5. Prompt the Agent

I used my AI assistant (vscode copilot) to build the pipeline by providing the API details in my prompt since there was no scaffold.

Build a REST API source for NYC taxi data.

API details:

- Base URL: https://us-central1-dlthub-analytics.cloudfunctions.net/data_engineering_zoomcamp_api

- Data format: Paginated JSON (1,000 records per page)

- Pagination: Stop when an empty page is returned

Place the code in taxi_pipeline.py and name the pipeline taxi_pipeline.

Use @dlt rest api as a tutorial.

6. Run and debug

python taxi_pipeline.py

7. dlt Dashboard: dlt pipeline taxi_pipeline show

dlt pipeline taxi_pipeline show

Press enter or click to view image in full size

8. dlt MCP Server: Ask the agent questions about your pipeline

These were the main questions from the assignment to be answered, but there are many other questions that can be asked.

Insights from the Dataset

After running the pipeline and analyzing the data, several interesting findings were obtained:

Dataset Range: The taxi trips spanned 2009–01–01 to 2009–01–31, with over 10,000 trips recorded.
Payment Methods: Approximately 26.66% of trips were paid with credit cards, highlighting a mix of cash and digital payments.
Tips Analysis: The total tips collected amounted to $6,063.41, with a few trips having exceptionally high tips.
Passenger Trends: On average, there were 2.09 passengers per trip, showing typical ride occupancy.

Lessons Learned

This project reinforced several important concepts in modern data engineering:

Building pipelines from scratch is greatly accelerated with AI agents and MCP servers.
DLT provides a structured framework for managing datasets and pipeline metadata, making analysis reproducible.
Choosing a lightweight analytical database like DuckDB allows fast, in-process querying without complex infrastructure.
Clear data architecture diagrams help communicate the flow and responsibilities of each component.

Conclusion

By building a DLT pipeline for NYC taxi data, I successfully integrated a custom API with DuckDB, explored and analyzed the data, and generated visual insights. This workflow demonstrates the power of automated data pipelines combined with AI-assisted development, providing a foundation for future data engineering projects.

Here’s my homework solution: https://github.com/stephandoh/zoomcamp_DE_DLT_2026

Following along with this amazing free course with the link below:

You can sign up here: https://github.com/DataTalksClub/data-engineering-zoomcamp/

Page updated

Google Sites

Report abuse