Projects

Tegisty Hailay Degef

Highlights

Scalable Data pipeline warehouse with airflow, DBT and Postgres


A project to create a scalable data warehouse that host the vehicle trajectory data extracted by analyzing footage taken by swarm drones and static roadside cameras. ELT framework along with airflow used to setup transformation workflows on the required objective. Airflow for scheduling tasks and DBT for transformation and build data model for analytics purpose was applied

Tools:

  • Orchestration: We’d still need to orchestrate the execution of our pipelines to ensure that the data is available as soon as possible and that the data lifecycle is running smoothly from one component to the next one.

  • Data monitoring: To be able to trust our data we’d need to monitor it and make sure that we’re generating accurate insights based on it.

  • Data visualization: This is where we actually get to explore the data and generate value from it under the form of different data products, like dashboards and reports.

  • Metadata management: Most of the features of our platform (like data discovery and data governance) rely on metadata, and so we need to ensure that the metadata is centralized and leveraged throughout the platform.

  • Airflow: In Airflow, data pipelines are defined in Python code as directed acyclic graphs, also known as DAGs.

  • DAG: In Airflow, a DAG is a collection of all the tasks you want to run, organized in a way that reflects their relationships and dependencies.

  • DBT (Data Build Tool): In this project pipeline, DBT applies a simple transformation on the ingested data using a SQL query.

  • Postgres: PostgreSQL is used as a primary database for many web applications as well as mobile and analytics applications. We have used it to store the raw and transformed data.

  • Redash: Redash is an open source web application used to explore, query, visualize, and share data from our data sources.

Methodologies

For this project, the ELT (Extract, Load and Transformation) technique was applied to DW (Data Warehouse). The ELT works as; at the beginning data extracted from its source locations, and then loaded into a target data warehouse to be transformed into actionable business intelligence. This process consists of three steps:

Extract: Raw streams of data from virtual infrastructure, software, and applications are ingested either in their entirety or according to predefined rules.

Load: Rather than deliver this mass of raw data and load it to an interim processing server for transformation, ELT delivers it directly to the target storage location.

Transform: The data warehouse sorts and normalizes the data, keeping part or all of it on hand and accessible for customized reporting.


Text-to-Speech data collection with Kafka, Airflow, and Spark

A project to produce a tool that can be deployed to process posting and receiving text and audio files from and into a data lake, apply transformation in a distributed manner, and load it into a warehouse in a suitable format to train a speech-to-text model.

Technologies and tools

  • ReactJs : for front-end development.

  • Django and python: were used to develop the back-end of this project

  • Apache Kafka: Kafka used for speech to text data collection (to store the unprocessed audio text in a fault-tolerant, durable way) because of its high throughput, high scalability, low latency, permanent storage and high availability.

  • Apache Airflow: Apache Airflow is an open-source tool to programmatically author, schedule, and monitor workflows. Directed a cyclic graphs (DAG) are used by Airflow to control workflow orchestration. In this project, it schedules spark jobs as well as Kafka cluster jobs at the same time

  • Apache Spark: Apache Spark is the fast, flexible, and developer-friendly leading platform for large-scale SQL, batch processing, and stream processing. So in our case, we use spark with airflow scheduler to load, process and transform streams of audio data.

Approach:

To develop this project we were apply the ETL (Extract, Transform, and Load) approach. ETL is a data integration process that combines data from multiple data sources into a single, consistent data store that is loaded into a data warehouse or other target system.

  • .During data extraction, raw data is copied or exported from source locations to a staging area. Data management teams can extract data from a variety of data sources, which can be structured or unstructured. In the staging area, the raw data undergoes data processing. Here in transformation, the data is transformed and consolidated for its intended analytical use case. During the load phase, the transformed data is moved from the staging area into a target data warehouse. Typically, this involves an initial loading of all data, followed by periodic loading of incremental data changes and, less often, full refreshes to erase and replace data in the warehouse.



Computer Vision for Creative Optimization: KPI maximization through image analysis

The main task of this project is to apply deep learning based computer vision techniques for creative optimization in mobile advertising. Different deep learning based image analysis (feature extraction and segmentation) techniques were used and a random forest regression algorithm was applied to predict the KPI performance.

Medium

Technologies and tools

  • OpenCV: is a software toolkit for processing real-time image and video, as well as providing analytics, and machine learning capabilities.

  • Pytesseract: Pytesseract or Python-tesseract is an Optical Character Recognition (OCR) tool for Python. It will read and recognize the text in images, license plates etc

  • Extcolors: Command-line tool to extract colors from an image. The result will presented in two formats, text and image

  • SSD: The Single Shot Detector model is the state of the art for object detection in images

  • Selenium: Selenium WebDriver is a web framework that permits you to execute cross-browser tests.

  • Deepface: is a lightweight face recognition and facial attribute analysis ( age, gender, emotion and race) framework for python.

  • Random forest regression: is an ensemble learning technique capable of performing both classification and regression with the help of an ensemble of decision trees. In our case, it used to predict the KPI.

Model Score and Feature Importance