We already have an existing Data Platform to handle various batch and streaming workloads. This legacy framework is based on Perl for data transformation and uses SAP Sybase as the back-end data Warehouse.
As part of the Index data team, I was responsible for modernizing the current infrastructure. The Modern stack is entirely cloud native using multiple docker containers and AWS Fargate to manage the system. It uses Python code for the pipelines.
The Data platform so designed is part of a Django based web application. The application is deployed on AWS Elastic beanstalk.
The Data warehouse used is Snowflake to handle the varying data formats. The resulting platform is fault-tolerant and highly scalable.
Data Lineage:
1. The incoming data from several sources is deposited into a centralized object store (AWS S3 bucket in this case).
2. The landed data is then pre-processed using a python script to handle some basic standardization. The script is hosted on an AWS ECS Fargate cluster
3. The resultant semi-clean data is transferred to a snowflake instance using the concept of storage integration for an external stage.
4. The data movement for (point1 - point3) is handled by triggering Apache Airflow Dags using cron expressions.
5. Once the data lands at Snowflake cloud warehouse, we perform one more round of transformation on the data. The transformation logic is defined as per business requirements and may be different for each vendor.
6. The data transformations are carried out using Data Build Tool (dbt) and involves creating transformed SQL models and populating them finally into our consumption level tables.
7. The consumption tables and related views are then shared across to other teams.
The EDP 3.0 architecture (Proposed)
As part of the Software Engineering team on the project, I was tasked with handling the data migrations between the legacy system and the newly created cloud-based data platform.
I created several workflows for different vendors leveraging the compute resources of AWS lambda and Snowflake.
Snowflake was the platform employed for handling ETL jobs using a combination of streams and tasks, among other sf objects.
The entire data pipeline was orchestrated using Apache Airflow.
Apart from developing the new pipelines, I was involved in maintaining legacy pipelines so as not to affect the day-to-day operations.
The client is a leading Supply chain management solution provider. The Sources include Inventory data from the various operational warehouses. This data is available in a variety of formats but is always structured in nature
As the Senior lead Engineer, I had to enable the team to move towards a more robust and cost-effective data solution.
Since the existing infrastructure was already based on MS Azure, integrating the Snowflake ETL framework was the major task.
The incoming data was structured and landed onto an ADLS storage, this triggered an event which was relayed using Event Hub.
Once the data lands to Snowflake, I was in-charge of creating the ETL using Snowflake Stored Procedures and Tasks.
Data Lineage:
1. The data from several sources like APIs or OLTP systems are transferred to a data lake store (ADLS gen2 here).
2. The storage is connected to an event hub which stores all the landing events and associated metadata.
3. The data is fetched one by one and some basic transformations like date and currency formatting is performed on the dataset. As the logic may differ for different data, we usually invoke a serverless function using an associated waterfall logic.
4. The transformed data is placed onto the ADLS storage and it is pulled into Snowflake using Azure external stage object and storage integration.
5. Snowflake stored procedures are used within the warehouse to handle further data manipulation. They are scheduled using Snowflake tasks.
6. The transformed and consumption-ready data is finally pushed to our curated tables and further sent to downstream systems.
It is a leading game provider organization responsible for games like GTA V, Read Dead Redemption among others. As a growing entity, the company is looking to track its revenue generation across geographies. The Input data was segregated into daily, weekly, monthly and accrual revenues.
My designation was Specialist Data Engineer and I was in charge of creating pipelines specifically for the Asia Pacific Region.
The data wrangling and manipulation scripts were written in Python and hosted on top of AWS ECS fargate containers.
The cleansed data was subsequently pushed to another curated bucket.
After ascertaining the quality of the incoming data via great expectations, it was loaded to Snowflake data warehouse using python connectors.
There was no further processing of the data from the data engineering standpoint. The ingested data was converted into materialized views and passed over to the visualization team for making interactive dashboards.
Data Lineage:
1. We collect the vendor-specific data into a centralized object store (AWS S3 bucket in this case).
2. The landed data is then pre-processed using a python script to handle some basic standardization. The script is hosted on an AWS ECS Fargate cluster
3. The obtained partially cleansed data is transferred to a different AWS s3 cleansed bucket.
4. The data flow is orchestrated by AWS managed Airflow for the first half of my data pipeline.
5. The data then lands to snowflake tables and subsequent ETL processes are carried out on snowflake. We extract the semi cleaned data and further clean the data and enhance it as per vendor specific requests. All ETL operations are carried out using Snow SQL
6. The prepared data is finally sent to downstream systems for analytics.
The Incoming data was segregated according to the providing vendor. The raw data from several sources was collected to a storage location.
The data was then sent to an ECS Fargate cluster having transformation logic written in Python.
For specific vendors, we have additional transformation logic in AWS Lambda serverless.
Once we have the cleansed data, we ingest the same to Snowflake Data Platform.
We further perform ETL on the data and share the tables to downstream systems for further processing.
The entire workflow, from receiving the Raw data to landing in snowflake, is orchestrated using AWS Managed Airflow.