Data Engineering & Machine Learning
Data Engineering & Machine Learning
Here are some of my projects in Data Engineering and Machine Learning, showcasing my practical experience and expertise in Data Engineering, ML models, Data Analytics, Pipeline Development, Programming, Database Management, Deployment Strategies, and Report Writing. These projects highlight my ability to conceptualize, implement, and deliver robust solutions across various stages of the data lifecycle, from data acquisition and preprocessing to model training, deployment, and generating actionable insights.
This project leverages advanced data analytics and machine learning techniques to deliver accurate sales forecasting and store-level promotion analysis for Rossmann Pharmaceuticals' retail network. By integrating historical sales data, store characteristics, promotional activities, competitor information, and external factors, this project provides actionable insights for strategic decision-making and resource optimization.
The solution includes the following key components:
Data Integration and Preprocessing: Collecting and cleaning data from various sources to create a comprehensive dataset that captures key sales drivers, such as store-specific promotions, competition, and holidays.
Exploratory Data Analysis: Identifying patterns in customer behavior, the impact of store-level promotional activities, and seasonal trends.
Feature Engineering: Developing predictive features to represent relationships between store promotions, sales trends, and external influences.
Sales Forecasting Model: Building and deploying machine learning models to predict sales six weeks in advance, enabling better inventory management and marketing planning.
Promotion Impact Analysis: Evaluating the effectiveness of store-level promotional strategies, including short-term discounts and long-term campaigns, to optimize promotional efforts and maximize sales impact across the store network.
Outcome:
This project empowers Rossmann Pharmaceuticals to shift from intuition-based decisions to data-driven strategies, improving the effectiveness of store-specific promotions, customer engagement, and overall profitability. The insights from the promotion analysis further help refine the allocation of resources for tailored and impactful marketing strategies.
Technologies/Tools Used: Python: Pandas, NumPy, and Scikit-learn. Machine Learning: regression models, time series forecasting models, and ensemble methods, Tools: Jupyter Notebooks, Git/GitHub
The Ethiopian Medical Business Data Warehouse & Analytics Platform aims to enhance the efficiency of Ethiopia's healthcare sector by creating a robust data warehouse. The project will extract data and images from public Telegram channels related to Ethiopian medical businesses, perform object detection on the images, and clean, transform, and store the extracted data in the warehouse. The main goal is to provide a unified solution for data analysis, supporting informed decision-making and driving strategic advancements in healthcare.
Technologies/Tools Used: Python, DBT, SQL, ETL, PostgreSQL, FastAPI, Pandas, Pytest, SQLAlchemy, YOLOv5 , Postman, CI/CD, Jupyter Notebook, Git.
This project aims to develop advanced machine learning models for credit risk assessment and loan optimization in the context of a buy-now-pay-later service. The key objectives of this project are:
Customer Segmentation: Segment customers using RFMS scores to classify them into high-risk and low-risk groups, enabling tailored BNPL or loan services.
Credit Scoring Model: Create a machine learning model that can accurately predict the credit risk and default probability of new customers applying for the BNPL service.
Loan Optimization Model: Develop a machine learning model that can determine the optimal loan amount, repayment period, and other terms for each applicant based on their credit profile and other relevant factors.
Technologies/Tools Used: Python, Pandas, NumPy, Feature Engineering, WoE, Scikit-learn, Matplotlib, Jupiter Notebook, FICO Scoring, EDA (Exploratory Data Analysis), CI/CD, Git.
This project dedicated to advancing risk and predictive analytics within car insurance planning and marketing, representing an innovative insurance solution that leverages advanced technology and data analytics. The primary objectives are to optimize insurance processes, elevate risk assessment capabilities, and enhance customer experiences, all achieved through the utilization of advanced technologies, specifically predictive modeling and data analytics.
Technologies/Tools Used: Python, Data Version Control (DVC), Pandas, NumPy, Feature Engineering, Scikit-learn, Matplotlib, A-B Testing, EDA (Exploratory Data Analysis), Jupiter Notebook, CI/CD, Git.
The primary objective of this project is to analyze how significant events such as political decisions, conflicts in oil-producing regions, global economic sanctions, and changes in OPEC policies impact the price of Brent oil. This project will:
Identify Key Events: Pinpoint the major events over the past decade that have significantly influenced Brent oil prices.
Measure Impact: Assess the degree to which these events contribute to price fluctuations.
Provide Actionable Insights: Deliver clear, actionable insights that will assist investors, policymakers, and energy companies in understanding and responding to these price changes effectively.
By tackling this issue, Birhan Energies aims to empower its clients to make informed decisions, manage risks more efficiently, and optimize strategies for investment, policy development, and operational planning within the energy sector.
Technologies/Tools Used: Python, Pandas, NumPy, Matplotlib, Plotly, Jupyter Notebook, Scikit-learn, PyMC3, LSTM, ARIMA, CI/CD, Git.
Statistical Techniques: Bayesian Inference, Probability Distributions, Statistical Modeling, Bayesian Modeling
The Fraud Detection project for E-commerce and Banking Transactions aims to significantly improve the identification of fraudulent activities within these sectors. It focuses on developing advanced machine learning models that analyze transaction data, employ feature engineering techniques, and implement real-time monitoring systems to achieve high accuracy in fraud detection.
Technologies/Tools Used: Python, Flask, API, Model Explainability (LIME & SHAP), Pandas, NumPy, MLflow, Scikit-learn, Matplotlib, EDA (Exploratory Data Analysis), Jupiter Notebook, CI/CD, Git.
This project developed a comprehensive sentiment analysis( Leveraging natural language processing (NLP) techniques) and marketing dashboard solution for an Ethiopian bank. By integrating data from the bank's mobile app, social media, and other customer channels, the dashboard provided the marketing and sales teams with real-time insights on customer sentiment, ad performance, app usage, and subscriber growth.
Technologies/Tools Used: Python, NLP, ETL, Apache Superset, Kedro pipeline, Pandas, NumPy, Docker EDA (Exploratory Data Analysis), CI/CD, Jupiter Notebook, unittest, Git.
This project focuses on the detailed analysis of a large corpus of financial news data to discover correlations between news sentiment and stock market movements. The objective is to enhance the predictive analytics capabilities of Nova Financial Solutions, a financial company, to significantly boost its financial forecasting accuracy and operational efficiency through advanced data analysis.
The project consists of two main tasks
1. Sentiment Analysis: Use NLP to quantify sentiment in financial news headlines, linking sentiment scores to corresponding stock symbols to understand news tone around specific stocks.
2. Correlation Analysis: Analyze statistical correlations between news sentiment and stock price movements by examining stock price changes around the publication date to assess sentiment impact on performance.
Technologies/Tools Used: Python, NLP, ETL, Apache Superset, Kedro pipeline, NumPy, Pandas, Docker EDA (Exploratory Data Analysis), CI/CD, Jupiter Notebook, unittest, Git.
This project aims to enhance MoonLight Energy Solutions' operational efficiency and sustainability through data-driven analysis. By identifying key environmental trends and high-potential regions for solar installations, it aligns with the company’s long-term goals. A dynamic dashboard is developed to provide real-time insights.
Technologies/Tools Used: Python, Pandas, StreamLit Dashboard, EDA (Exploratory Data Analysis), CI/CD, Jupiter Notebook, Pytest, Git.