Projects

XGBoost forecast versus actual weekly sales for The Alchemist.

XGBoost forecast versus actual weekly sales for TheVery Hungry Caterpillar.

Sales Forecasting with Time Series and Machine Learning

Context:
This project explored demand forecasting for two popular book titles using historical sales data from Nielsen. Accurate weekly and monthly forecasts are vital for publishers managing print runs, inventory, and marketing. The project compares classical, machine learning, deep learning, and hybrid approaches to identify the best-performing method.

My role:
I led the full end-to-end modelling pipeline: from data cleaning and exploratory analysis through to implementing and comparing SARIMA, XGBoost, LSTM, and hybrid models. I also structured the evaluation framework and produced visual forecasts with confidence intervals.

Process:
Sales data for The Alchemist and The Very Hungry Caterpillar (from 2012 onwards) were resampled to fill missing weeks and split into training and testing periods. After SARIMA modelling and feature engineering, I built XGBoost pipelines using lag-based features, designed and tuned LSTM models, and tested SARIMA–LSTM hybrid strategies. Monthly aggregation models were also implemented to support broader planning cycles. Each approach was assessed using MAE and MAPE over 32-week and 8-month test windows.

Outcome:
XGBoost outperformed all other models across both books and both forecast frequencies, achieving <10% MAPE in weekly predictions. SARIMA performed moderately well but struggled with irregular monthly patterns. LSTM models consistently underfitted the data, and hybrids showed no advantage over standalone SARIMA. The project confirms that a well-structured XGBoost pipeline offers the most reliable and interpretable results for sales forecasting.

View on GitHub

Download Report (PDF)

Final 100 weeks of actual vs forecast sales using a SARIMA(1,0,1)(1,1,1,52) model. The shaded band highlights forecast uncertainty, with moderate alignment to true values but limited ability to capture post-peak dynamics.

Boxplots showing key features that distinguish filled from unfilled orders during Buy (top row) and Sell (bottom row) weeks. Filled orders tend to show stronger intra-week movement in the trend direction, confirming the model's ability to assess execution risk.

Market Trend & Fill Modelling with Technical Indicators

Context:
This project developed a two-stage trading model to support weekly ETF investing under real-world constraints — where all decisions must be made before Monday market open. It predicts both the upcoming price trend and whether a limit order would realistically be executed within the week.

My role:
I led the entire project from design to deployment. This included technical feature engineering (EMAs and MACD), trend labelling logic, classification model development, and the creation of a fully modular Python pipeline. I also produced a visual summary report for portfolio use.

Process:
Using only lagged weekly OHLC data, I built two decision tree classifiers. The first predicts the next week's trend direction (Buy / Sell / Sideways) using engineered indicators. The second estimates whether a directional limit order would be filled based on intra-week price movement. The project was implemented as a clean, script-based repo with a run_project.py controller and full modularity.

Outcome:
The trend classifier showed moderate accuracy, but the fill detection models performed strongly — with over 85% accuracy and 96% recall for filled orders. The final system outputs both directional signals and realistic execution guidance, supporting cautious, rules-based trading decisions. The pipeline is generalisable to other ETFs or stocks and ready for future upgrades (e.g. XGBoost, cross-validation).

View on GitHub

Download Report (PDF)

Truncated decision tree structure (maximum depth = 2) used for weekly trend classification.

t-SNE visualisation showing separation between customer segments.

Customer Segmentation with K-means Clustering

Context:
A global e-commerce company wants to group customers into segments based on behaviour and value, to improve marketing and retention.

My role:
I performed the full analysis: data cleaning, feature engineering, model selection, and result interpretation.

Process:
Created five behavioural/value features (e.g. Frequency, CLV). Explored k using Elbow, Silhouette, and Hierarchical methods. Chose k = 4. Applied K-means. Visualised results with boxplots and t-SNE.

Outcome:
Identified four clear customer groups — from high-value loyal buyers to disengaged churn risks. This enabled strategic marketing recommendations.

View on GitHub

Download Report (PDF)

Boxplots showing distribution of key features by customer segment.

Confusion matrix showing test performance of the neural network model, with accurate predictions across both dropout and completion classes.

Predicting Course Completion with XGBoost and Neural Networks

Context:
This project explored predictive modelling to identify students at risk of dropping out of an online course. The goal was to support early intervention and improve retention outcomes.

My role:
I led all stages of the machine learning pipeline: from data preparation and feature engineering, through to model development, tuning, evaluation, and results interpretation.

Process:
I trained and evaluated multiple classification models, including XGBoost and a neural network. XGBoost was tuned using cross-validation to improve dropout prediction. The neural network architecture was refined and trained using standardised features and validation curves to monitor performance.

Outcome:
Tuned XGBoost emerged as the strongest model, achieving 94.9% accuracy and 0.958 AUC. It outperformed the neural network on all key metrics, with slightly better dropout recall and fewer false negatives. The comparison confirmed that XGBoost was both accurate and interpretable, making it suitable for practical deployment.

View on GitHub

Download Report (PDF)

Confusion matrices comparing the performance of XGBoost before (left) and after (right) hyperparameter tuning.

Analysing Gym Reviews with NLP and Topic Modelling

Context:
Understanding customer dissatisfaction is key to improving service and retention. This project analysed over 6,000 low-rated reviews from Google and Trustpilot for a major UK gym chain, using NLP techniques to extract key emotional and topical insights.

My role:
I led the end-to-end process: data cleaning, text preprocessing, sentiment filtering, emotion detection, topic modelling (LDA and BERTopic), and results interpretation across platforms.

Process:
I began by isolating negative reviews and identifying dominant emotions using a pre-trained NLP emotion classifier. I then applied topic modelling to uncover common complaint themes. The LDA model revealed 16 coherent topics, while BERTopic with LLM-enhanced phrases gave additional insight into key complaint clusters.

Outcome:
Emotion analysis showed that anger was the most common emotion across both platforms, followed by sadness and joy. Topic modelling uncovered recurring concerns such as equipment issues, cancellation policies, overcrowding, and staff attitude. These insights provided actionable guidance on service areas needing improvement.

View on GitHub

Download Report (PDF)

Distribution of dominant emotions in negative reviews from Google (left) and Trustpilot (right).

Anomalies detected using Isolation Forest, visualised in 2D with PCA (3% contamination).

Anomaly Detection in Ship Engine Data

Context:
Monitoring ship engine data helps prevent equipment failure and costly downtime. This project used sensor readings from six engine-related features to detect abnormal behaviour.

My role:
I performed exploratory analysis, applied statistical and machine learning methods, and evaluated model performance to recommend a reliable anomaly detection approach.

Process:
I began with visual analysis and statistical outlier detection using the IQR method. Then I applied One-Class SVM and Isolation Forest, tuning each model to detect 1–5% anomalies. Visualisations with PCA helped interpret results.

Outcome:
Isolation Forest was the preferred method — efficient, interpretable, and easy to tune using the contamination parameter. The final model detected 3% anomalies, supporting early fault detection without excessive false alarms.