Sample Projects

Blaise Papa

Highlights

USGS Lidar Data Engineering


The project aimed at creating a data pipeline that enables researchers to easily access the newly released United States Geological Survey Lidar Data.

Background


The USGS 3DEP project (United States Geological Survey 3D Elevation Program) aims at responding to growing needs for high-quality topographic data and a wide range of 3D data representation of the county’s features. You can read about the full project here. The data is stored in a repository on an amazon server. The server contains geospatial data on over 1000 geographical regions. The data stored is in a .ept JSON file format. Entwine Point Tile format (ept)is a simple way to store data. It achieves this by a simple tree-based format. To enable processing ept has crucial keys that enable smooth processing. The dataset is however very complicated to understand

Output:

The project developed the USG3 package that makes it easier for geospatial analysis. The python package is currently in its first version and future improvements are in the works.

Achievements:

  • Create a function to ingest and process ept files, giving the user higher flexibility on how they can manipulate the ept file through changing various aspects of the pipeline.

  • Generate raster files from the ept file pipeline, from the raster files user is able to generate a raster image of the land overlay

  • Generate shapefiles that contain elevational and shape data, the shape files' information can moreover be used to calculate elevation, topographic wetness index , and gradient

  • Visualize land overlay from lidar data in both 2 and 3D


You can read more on the initial USG3 package created through this project in the GitHub repo and official package documentation

Visualizations


Alzheimer's Prediction using MRI Scans

The project employed the use of image classifiers to improve the detetction of Alzheimer's using MRI scans



Background

Alzheimer’s disease is an irreversible, progressive disorder that slowly destroys memory and thinking skills. It eventually impedes the person from carrying out the simplest everyday tasks. The onset of symptoms mainly appears when the person is in their mid-60’s. If not addressed it rolls on to become the most common cause of dementia among older adults.

There is currently no cure for the disease and the best way to curb deaths from the disease is by early treatment. This helps manage the symptoms and increases the life expectancy of the person. With over 3.48 million people suffering from Alzheimer's disease there is a need to find a way to reduce this number as it is projected an estimated 7.62 million people will have the disease by 2030 in sub-Saharan Africa.

Medical research has revealed the cause of Alzheimer’s disease is an abnormal build-up of proteins in and around brain cells. Although it’s not known exactly what causes the process to begin, scientists now know that it begins many years before the symptoms appear.

The project seeks to use MRI scans to predict the early onset of Alzheimer's disease using deep learning techniques. Early prediction of this disease will trigger an early response by medical practitioners to begin proactive treatment to improve and ultimately increase the life expectancy of the individual




Machine Learning

  • The project modeled a 9-layer deep learning convolution neural network based on different TensorFlow neural networks comprising of 5 convolution layers and 4 dense layers.

  • The project also implemented transfer learning models for image classification. The model used was Inception-v3, a widely used image recognition model developed off the Googlenet convolution neural network. It is praised for achieving high levels of accuracy and is easily configured to fit any image recognition model.

  • The model achieved an accuracy of 86% with an f1_score of 85.4%

Visualizations


Pharmtec Sales Prediction

The project employed the use of deep-learning models to predict customer sales across several stores. It employed a simple Recurrent Neural Network and Facebook's prophet algorithm.

Background

A Pharmtec company wants to forecast sales in their stores across several cities six weeks ahead of time. We use a range of models from decision trees, deep learning models (LSTM), and Facebook's new algorithm, Facebook Prophet. The project uses the given data to extract the features and use them to help predict store sales across the stores



Pharmtec Sales Analysis

Machine Learning

  • We employed the use of deep learning models to predict sales predictions specifically use f LSTM algorithm. LSTM was picked due to its memory feature that enables it to remember certain aspects of the data during each iteration, enabling it to perform better in this case. This was benchmarked against a random forest regressor and outperformed it.

  • We integrated Facebook's prophet algorithm which has the capability of taking into account seasonality and also predict sales up to 6 weeks ahead of time. The use of this algorithm proved to be efficient at learning seasonality trends and hence optimize predictions.

  • The model was deployed using Heroku and flask, this enabled real-life interaction with the system.

Visualizations


Swahili speech-to-text


This was a group project that was aimed at coming up with a speech-to-text system that was capable of learning from speech audio and transcribe the predicted text with high accuracy and robustness against background noise.

The system employed the use of an LSTM model which is favored for speech and audio analysis. It also used CTC based off of a simple RNN

Background

The project is inspired by the use of speech recognition systems in everyday life. Speech recognition systems such as Siri, Alexa and google assistant convert speech into action and hence automate and make work easier.

These speech recognition models are however limited to a select few languages creating a huge language barrier. The project aims to integrate Swahili language into a speech-to-text system that allows the user to echo in commands in Swahili which will then be converted into text. The speech-to-text system will be integrated into a mobile app for food purchases.


Architecture:

  • Connectionist Temporal Classification- this model has a specific formula that is suitable for speech detection. The CTC model uses softmax functions from an RNN (Recurring Neural Networks). It begins with a series of data frames of length T which are linked to the target corpus Y of length L. T has to be greater than L. The model has a softmax layer that corresponds every data frame to its specified output. The model generates a log probability of different classes per each new input, at every given step.

Metrics:

  • WER (Word Error Rate) It is calculated by measuring the number of correctly classified samples over the total number of samples


Visualizations


Telecom Data Anlysis

This project used Telecom data to analyze the market share of a mobile company.

It employed the use of clustering algorithm -(K-means) to group telecom customers according to behavior and through this determine the best KPI for investment

Background

An investor is interested in investing in a Telecho company. We analyze and provide a report describing the opportunities for growth and make a recommendation on whether TellCo is worth buying or selling. We achieve this by analyzing a telecommunication dataset that contains useful information about the customers & their activities on the network, delivering insights we extracted to the employer through an easy to use web-based dashboard and a written report.



Architecture:

  • The project used clustering and tree algorithms to cluster users into different user groups according to the


Visualizations