Projects

Amon Kimutai

Highlights

A/B testing is a user experience research methodology. A/B tests consist of a randomized experiment with two variants, A and B., which are identical except for one variation that might affect a user's behavior. It includes application of statistical hypothesis testing or "two-sample hypothesis testing" as used in the field of statistic. This project focused in evaluating the brand awareness lift in SmartAd, after providing an additional service called Brand Impact Optimiser (BIO), a lightweight questionnaire, served with every campaign to determine the impact of the creative.

A/B TESTING

Metric Choice:

  • Invariant metrics-Used this to ensure that the experiment (the way we presented a change to a part of the population )is not inherently wrong. eg number of users in both groups

  • Evaluation metrics-metrics we expect to change and are relevant to the goals we aim to achieve eg (brand awareness) Hypothesis testing for A/B testing

  • We use hypothesis testing to test the two hypotheses: Null Hypothesis :There is no difference in brand awareness between the exposed and control groups in the current case. Alternative Hypothesis:There is a difference in brand awareness between the exposed and control groups in the current case.

Machine Learning

  • Trained a machine learning model using 5-fold cross validation using 3 different algorithms; Logistic Regression, Decision Trees and XGBoost to predict the important features that contributes to 'Yes' results, and therefore, lift brand awareness given the new ad has a new a feature added.


Pharmaceutical Sales prediction across multiple stores


Precise sales prediction is vital in any company to strive well in the present world because it gives a balance between the supply and demand, thus serving best. In this project, I performed a six weeks prediction for a pharmaceutical company that has multiple stores, making different sales as influenced by factors such as promotions, competition, school and state holidays, seasonality and locality.

APPROACH

Exploration:

  • Performed data exploration to unravel the nature of the data for proper modelling. Also, data exploration ensured a prior knowledge to customers' purchasing power.

Predictions:

  • Machine Learning model and Deep Learning model were used.

Machine Learning- The model was built with sklearn pipeline using the Random Forest Regressor algorithm. Loss function established the model evaluation.

Deep Learning- An RNN model was built to predict the sale for different stores, utilizing the tensorflow library.

Speech recognition system


Speech recognition technology allows for hands-free control of smartphones, speakers, and even vehicles in a wide variety of languages. Companies have moved towards the goal of enabling machines to understand and respond to more and more of our verbalized commands. In this project, we worked in a group of 9 to build a system that convert Swahili speech to text. The World Food Program wanted to deploy an intelligent form that collects nutritional information of food bought and sold at markets in two different countries in Africa - Ethiopia and Kenya.

APPROACH

Preprocesing:

  • To feed the audio signal to the model for the conversion, the data had to be processed to the right format. The signal were ensured to be of same channel by converting mono channels to stereo channels. Then, standardization of the sampling rate and resizing to obtain the same length by padding was done. Data augmentation was done prior to obtaining Mel Frequency Cepstrum Coefficients (MFCC) for audio feature extraction.

Modelling:

  • The features extracted were then fed to acoustic model which map the audio signal to the basic units of speech such as phonemes or graphemes.

  • A deep learning model was built to convert speech to text using the Connectionist Temporal Classification Algorithm for training and inference.

Metrics:

  • The Word Error Rate (WER) was used in model evaluation.

industry - casualty


A common frustration in the industry, especially when it comes to getting business insights from tabular data, is that the most interesting questions (from their perspective) are often not answerable with observational data alone. Therefore, causal graphs aids in drawing causes to effect relationships including the hidden features which may be inferred to be causing the outcome or an immediate treatment. In this project, I used the data features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass, for diagnosis (benign or malignant).

APPROACH

Data Exploration:

  • The outcome variable (diagnosis) yielded class distribution: 357 benign (not cancer), 212 malignant (cancer).

Feature Selection:

  • Features of greater importance were selected using Random Forest. Out of 32 features, 10 were selected.

Causal Graph:

  • The causal graph was created using the CausalNex, and with the increasing fraction of the data, different causal graphs were created, giving the causal graph with data to be the most stable graph.

Modelling:

  • The Bayesian network (BN), a probabilistic graphical model for representing knowledge about an uncertain domain, was used.

  • The model predicted the outcome (benign or malignant).

Metrics:

  • Jaccard similarity index was used to in measuring the intersection and union of the graph edges.

  • Classification Report that showed the precision, recall and accuracy was used.

  • Area Under Curve (AUC) was also used to find the accuracy of the model

Speech-to-text data collection


Streaming data is helpful in improving the model and reducing the model drift. Therefore, automating the process of utilizing the streaming data to enhance the model is aided by Kafka, Airflow and Spark tools.

In a team of eight members, we designed and built a Kafka cluster that can be used to post a sentence and receive an audio file. The deployed tool processed posting and receiving text and audio files from and into a data lake. To build a robust tool, every member was assigned a part to work on. I worked on Airflow, particularly writing the DAG script to orchestrate the storage of events, that is, triggering the spark and fetching data from kafka cluster to S3 bucket.

APPROACH

Creation of a Kafka cluster:

  • The cluster was set up in the assigned AWS machine. The cluster was responsible in creating a bucket in S3 where Spark transformed streaming data from users reading the texts.

Javascript tag:

  • The tag was used in front-end applications to communicate with the Kafka cluster, to present a sentence to be read by a user and send back audio and other necessary metadata to the Kafka cluster.

Spark:

  • A code was written to transform(processing) and load the data from the data lake, S3 bucket. Kafka was an input source for Spark Structured Streaming and Delta Lake as a storage layer.

Airflow:

  • As the main purpose of this pipeline was to facilitate proper utilization of the streaming data, all these were task has to scheduled and ordered for a robust performance. Therefore, airflow was an appropriate tool for tasks scheduling. The DAG script orchestrated the storage events, fetching data from kafka cluster to the S3 bucket, and triggering the spark for data transformation and loading. In our case, the process was scheduled to run after every one hour.