This diagram shows a modern data platform where AWS is used as the data lake and Snowflake as the data warehouse, with dbt and Airflow for transformations and orchestration
Ingesting data from multiple sources
Storing data in an AWS Lakehouse (S3 + Iceberg)
Governing data using AWS Glue & Lake Formation
Transforming and modeling data using dbt
Orchestrating pipelines using Apache Airflow
Serving analytics through Snowflake and BI tools
ILakehouse + ELT + Analytics Warehouse pattern
Source systems: Operational files and databases send data into the platform using PySpark jobs or Snowpipe–style ingestion.
AWS S3 + Iceberg tables: Raw data lands in S3, then is organized into layers (staging → raw → integrated → access) stored as Iceberg tables for schema evolution and ACID capabilities.
Metadata & governance (Glue + Lake Formation): AWS Glue Catalog stores table metadata, and Lake Formation controls security, permissions, and data access policies across the lake.
Snowflake (warehousing & analytics): Curated data from S3/Iceberg is loaded into Snowflake, where business-friendly models are built using dbt SQL models.
dbt + Airflow: dbt implements transformations, tests, and documentation; Apache Airflow DAGs schedule and orchestrate both Snowflake/dbt jobs and any legacy Datastage/Autosys/scripts.
BI & applications: Tools like Tableau, Power BI, SAS, and other apps consume data from Snowflake for reporting and analytics.
Components shown
Files
Databases
PySpark / Snowpipe
What happens here
These are operational systems such as:
RDS (Oracle, PostgreSQL, SQL Server)
MongoDB
Flat files (CSV, JSON, Parquet)
Data is ingested either via:
Batch (PySpark, Glue jobs)
Near‑real‑time (Snowpipe, streaming jobs)
✅ Goal: Extract raw data from business systems
This is the core data foundation.
1. Storage Layer – Amazon S3 + Iceberg Tables
Layers shown
Staging Layer
Raw Layer
Integrated Layer
Access Layer
Explanation
S3 is the underlying object storage.
Apache Iceberg tables provide:
Transactional consistency
Schema evolution
Time travel
Data follows a multi‑layered structure:
Layer Purpose
Staging Temporary landing area
Raw Source‑aligned, minimal transformation
Integrated Business‑ready, conformed data
Access Optimized for analytics & consumption
✅ This enforces data quality, traceability, and auditability.
2. Metadata & Governance
Components
AWS Glue Catalog
AWS Lake Formation
Role
Glue Catalog
Central metadata repository
Table schemas, partitions
Lake Formation
Fine‑grained access control (column, row level)
Centralized governance
Secure cross‑account access
✅ Ensures security, compliance, and discoverability
Role
Serves as the analytics warehouse
Receives curated data from the S3/Iceberg layers
Optimized for:
BI dashboards
Ad‑hoc analysis
High concurrency
Key feature
Snowflake separates storage and compute
Works well with ELT workflows
What dbt does
SQL‑based transformations inside Snowflake
Creates:
Facts and dimensions
Business metrics
Semantic models
Benefits
Version control (Git)
Documentation
Testing (not null, uniqueness)
Modular transformations
✅ dbt pushes transformation logic closer to analysts.
Where Airflow is used
Orchestrates:
Glue jobs
dbt runs
Snowflake loads
Manages dependencies and retries
Two usages shown
Main orchestration (dbt, warehouse loads)
Datastage / Autosys / scripts migration to Airflow DAGs
✅ Airflow replaces legacy schedulers with cloud‑native orchestration.
Examples
Power BI
Tableau
SAS
Purpose
Consume curated Snowflake datasets
Provide dashboards and reports to business users
Data extracted from source systems
Landed into S3 (Staging → Raw)
Metadata registered in Glue Catalog
Security enforced via Lake Formation
Data refined into Integrated & Access layers
Loaded or queried into Snowflake
dbt transforms data into analytics models
Airflow orchestrates all steps
BI tools consume Snowflake data
Multiple catalogs (Glue vs Snowflake)
Duplicate definitions of schemas and logic
Governance consistency across lake and warehouse
⚠ Risk: Metadata drift and access misalignment.
Data exists in:
S3 (Iceberg)
Snowflake storage
Synchronization delays between lake and warehouse
⚠ Risk: Users querying stale or inconsistent data.
Snowflake compute costs
Glue job execution costs
S3 storage growth over time
⚠ Risk: Runaway costs without observability.
Multiple tools:
Glue, Lake Formation, Airflow, dbt, Snowflake
Dependency failures across tools
Monitoring and alerting gaps
⚠ Risk: Slower incident resolution.
Datastage / Autosys logic hard to translate into Airflow
Lack of standardized DAG patterns
⚠ Risk: Longer migration timelines, fragile pipelines.
Align Glue Catalog and Snowflake object naming
Use dbt docs as the business metadata layer
Enforce data contracts between lake and warehouse
✅ Result: Clear ownership and reduced confusion.
Prefer Iceberg external tables where possible
Use separate Snowflake warehouses:
ELT
BI
Auto‑suspend aggressively
✅ Result: Lower cost and better performance.
Enforce consistent rules per layer:
Raw = immutable
Integrated = business logic
Use CI/CD validation for schema changes
✅ Result: Predictable, reliable pipelines.
Use reusable DAG templates
Separate orchestration from transformation logic
Centralized logging + alerting (CloudWatch / Datadog)
✅ Result: Easier operations and faster recovery.
dbt tests for freshness, nulls, volume checks
SLA monitoring in Airflow
Anomaly detection on row counts
✅ Result: Faster issue detection before business impact.