I led and executed the migration of data workloads from the legacy environment to Databricks. I was responsible not only for the technical migration, but also for designing and implementing the new data architecture.
The project involved modernizing pipelines previously based exclusively on Azure Data Factory and SQL Server, migrating processing to Databricks jobs, while keeping ADF only for native integrations with systems such as ServiceNow and SAP HANA.
I redesigned the Data Lake architecture, organizing data into containers by purpose: structured data, unstructured data, documents, images, processed data, and data ready for processing. For structured data, I implemented a structure by business area and project, with partitioning by extraction date (year/month/day), optimizing incremental ingestion and processing.
I implemented a Lakehouse architecture with Bronze, Silver, and Gold layers. In Bronze, data is ingested with control by extraction date. In Silver, we perform deduplication by business key and Delta merges for standardization. In Gold, business rules are applied and analytical models (Star Schema and Snowflake) are delivered, ready for consumption.
I was also responsible for developing the migration validation stage, creating automated data quality tests, including volume checks, primary key consistency, duplicate detection, and cross-validation between source and migrated tables, ensuring full data integrity.
I also contributed to the evolution of the QA process and code governance, structuring the workflow with develop, QA, and main branches. Changes go through testing in develop and are only promoted to production after a stable QA workflow execution for at least seven days, increasing delivery reliability.
The solution was fully governed, with cataloging, group-based access control, and sensitive data encryption using AES-256, ensuring security and compliance in data consumption.
I was responsible for designing and implementing the entire data pipeline that feeds the agent with both structured contract metadata and unstructured documents. The data is sourced from enterprise contract management systems, specifically SAP Ariba and Coupa, consumed via APIs. We ingest contract attributes such as contract name, value, supplier, dates, and status.
The pipeline continuously checks for new contract IDs. When new contracts are detected, their metadata is ingested and promoted through the Lakehouse layers up to the Gold layer. In parallel, the pipeline downloads the attached contract documents from Ariba and Coupa and stores them in cloud storage under an unstructured data container, since these are PDF and document files.
To handle updates, the pipeline also performs daily checks for contracts with a last modified date greater than or equal to D-1. This allows us to capture both new and updated contracts. Updated records are merged into the Gold tables using Delta Lake merge logic.
For contracts that were modified, the pipeline re-downloads the updated documents and flags, in the Gold layer, which document version is the most recent. In the Silver layer, I designed a Slowly Changing Dimension table that keeps the full version history of contract documents per contract, enabling traceability and auditability.
These curated Gold tables and document references are consumed by the AI Contracts agent, which performs risk analysis over both the structured attributes and the contract content.
On top of that, I also developed a Power BI report that allows business users to visualize the AI agent’s risk analysis results and download the associated contract documents directly, closing the loop between data engineering, AI, and business consumption.
This project required careful handling of data freshness, versioning, and unstructured data management, and it clearly shows how I design pipelines to reliably serve AI-driven applications.