Projects

Tadesse Kebede

Highlights

Github Repository

Telecommunication Data Analysis

Before investing in a business it is a must to have the best understanding of the field. This project is all about analysing TellCo's users and finding out whether it is worth buying or selling. This project is performed to analyse whether Tellco is worth buying and selling. To perform the project the data extracted from a month of aggregated data on xDR is used In this work, data cleaning, transforming, exploration and analysis tasks are performed. Finally, user overview, engagement, experience and satisfaction analytics is done and the Telcco productivity is predicted. Data contain 150001 rows and 55 columns, and null values of 12.72%.Tasks such as null percentage checking, data cleaning, data distribution(skewness) and removing rows with small null values, and filling object columns with mode is performed. 1,495 null rows and 11 null columns are removed, and data reduced to 148506 rows & 44 columns. Null percentage after data cleaning is 0.0%.

APPROACHES

Understand the dataset, identify the missing values & outliers if any using visual and quantitative methods to get a sense of the story it tells
Identifying the top 10 handsets used by the customers. Then, identifying the top 3 handset manufacturers Next, identify the top 5 handsets per top 3 handset manufacturer.
Aggregating per user the information in the column number of xDR sessions Session duration the total download (DL) and upload (UL) data and by the total data volume (in Bytes) during this session for each application
Analysis
- - Univariate, Bivariate, and Multivariate analysis
  - Correlation Analysis
  - User engagement Analysis
Using k-means clustering algorithm, grouping users in k engagement clusters based on the engagement metrics:
Experience and Satisfaction Analytics.

Github Repository

A/B Hypothesis Testing: Ad campaign performance

Using A/B testing to test if the ads that the advertising company ran resulted in a significant lift in brand awareness. Comparing machine learning models vs A/B testing gave me insights on what to use in which particular problem.

A/B testing metric choice:

Invariant metrics-Used this to ensure that the experiment (the way we presented a change to a part of the population )is not inherently wrong. eg number of users in both groups
Evaluation metrics-metrics we expect to change and are relevant to the goals we aim to achieve eg (brand awareness) Hypothesis testing for A/B testing
We use hypothesis testing to test the two hypotheses: Null Hypothesis: There is no difference in brand awareness between the exposed and control groups in the current case. Alternative Hypothesis: There is a difference in brand awareness between the exposed and control groups in the current case.

Machine Learning models

Carried out 3 types of classification analysis to predict whether a user responds yes to brand awareness, namely: Logistic Regression Decision Trees XGboost, then compared the different classification models to assess the best performing one(s).

Github Repository

Pharmaceutical Sales prediction across multiple stores

Sales forecasting is the process of estimating future revenue by predicting the amount of product or services a sales unit will sell in the next week, month, quarter, or year. Machine Learning is used in Business Forecasting to increase the efficiency of the business. Rossmann Pharmaceuticals is one of the company which sales in all their stores across several cities six weeks ahead of time. Managers in individual stores rely on their years of experience as well as their personal judgment to forecast sales. But sometimes these managers assumption may negatively affect the sales The objective of this project is to build and serve an end-to-end product that delivers this prediction to analysts in the finance team.

APPROACHES

Feature engineering: derived features for better insights and prediction.
Pipeline: scaled features and used sklearn’s pipeline.
Machine learning: used multiple features and applied linear regression, random forest regressor, XGBoost, and light GBM models to predict the sales.
Deep learning: used an LSTM recurrent neural network that takes six weeks of historical sales data and makes prediction for future sales.
Built and deployed a prediction dashboard using streamlit and heroku.

Technologies used:

python, sklearn, tensorflow, keras, librosa, streamlit, heroku, dvc, mlflow, git

Github Repository

Data Engineering: Speech-to-text data collection with Kafka, Airflow, and Spark

We created a web app that will receive sound records of people reading a given text displayed on our front end. We saw that the amount of data was a crucial factor that makes our deep learning model during the Amharic language speech-to-text conversion so we have done data collection using the Apache tools. We implemented Apache spark, Airflow and Kafka concepts.

APPROACHES

Combined the implementations of Apache Kafka, Airflow, and Spark for better data collection.
We have seen the data shortage in the Amharic language and tried to collect the vast number of corpora of the language to have a more robust data pipeline.
For this project, we use Kafka as our broker, Airflow as our event listener and initiator, and Spark to do the data transformation and cleaning part,

METRICS:

We used the WER (Word Error Rate) and implement different data pipelines to have the smallest WER on our machine learning model.

Github Repository

AgriTech - USGS LIDAR Challenge

In this project, I took data using API provided by USGS_3DEP ( United States Geological Survey 3D Elevation Program). AgriTech is a company working on maize farms and this project is done for the study of maize farms for water flow across different geographical areas. Extraction, Visualization, and transformation of data were achieved in this project.

APPROACHES

Features:

Download point cloud data from the EPT resource on AWS cloud storage.
Terrain visualization
Data transformation

Technologies used:

Python, PDAL, Laspy, Geopandas, Pydocs, and Heroku

Github Repository

Beyond correlation: Causal Inference

We analyze the Wisconsin Diagnostic Breast Cancer (WDBC) data using machine learning techniques. It will be a classification difficulty because the WDBC data is class tagged. The data consists of 32 qualities, or features, divided into two classifications (B=Benign, M=Malignant). The "mean" traits and low radius mean had a causal effect on the detection of breast cancer, according to all of the tests we conducted using Causal inference.

APPROACHES

In this project, I implemented these main Machine learning and Data Engineering skills:

Pandas, Numpy, matplotlib, seaborn, and different python libraries: Before starting the machine learning and causal inference the first thing I did was to analyze the data and do some visualizations. I used these libraries for the Data Exploratory and Analysis part.
Causalnex library: After finishing my data analysis the next thing I did was to find out the causal relationship between my features and the target variable. To do that I used the widely implemented causalnex. I was able to see the causal graphs and infer the relations between different fractions of my data.
Jaccard Index of similarity: After plotting my ground truth causal graph the next thing I did was to compare different fractions of my data to the ground truth. I used the Jaccard index to implement this topic.

Machine Learning

Carried out 3 types of classification analysis to predict whether a user responds yes to brand awareness,namely: Logistic Regression Decision Trees XGboost ,then compared the different classification models to assess the best performing one(s).

Github Repository

Data Engineering: Data warehouse tech stack with MySQL, DBT, Airflow

The data is downloaded from Downloads – pNEUMA | open-traffic (epfl.ch). pNEUMA is an open large-scale dataset of naturalistic trajectories of half a million vehicles that have been collected by a one-of-a-kind experiment by a swarm of drones in the congested downtown area of Athens, Greece. Each file for a single (area, date, time) is ~87MB data. After downloading the data a DAG in Airflow that uses the python operator is created to load the data files into MYSQL database. Then, the program, which is used for connection of dbt with DWH and transformations of the data you executed via the Bash or Python operator in Airflow, is developed. Finally, the data is visualized using dbt view and deployed.

APPROACHES

Setup airflow with DAG scripts that fetch and store data from csv files to MySQL and load raw data to dbt.
Extracted data from csv files and stored them to MySQL.
Applied necessary transformations using dbt.
Visualized transformed data in Redash dashboard.
Dockerized everything.

Technologies used:

python, shell script, MySQL, Apache Airflow, Redash, Docker

Page updated

Google Sites

Report abuse

Projects

Highlights

Telecommunication Data Analysis

A/B Hypothesis Testing: Ad campaign performance

A/B testing metric choice:

Machine Learning models

Pharmaceutical Sales prediction across multiple stores

Data Engineering: Speech-to-text data collection with Kafka, Airflow, and Spark

AgriTech - USGS LIDAR Challenge

Technologies used:

Beyond correlation: Causal Inference

In this project, I implemented these main Machine learning and Data Engineering skills:

Machine Learning

Data Engineering: Data warehouse tech stack with MySQL, DBT, Airflow

For more information on hiring from 10 Academy, contact team@10academy.org