Data Science

Cover photo: Sion, Switzerland

Cedric the data scientist

I am interested in extracting useful knowledge and information from the ocean of data, be it in form of text, audio, equations or images. Combining the use of data, the scientific method and analytical thinking, I aspire to assist people in making well-informed decisions in all aspects of life. By helping advance data science and artificial intelligence, and advocating for their responsible and equitable use, I hope to help raise the quality of life of all.

See the CV page for my relevant skill set.

Select Past Hobby Projects:

(Apparently, I have done much more than these on the job since then.)

Past Hobby Data Science Projects (GitHub)

End-to-end example with the Disater Tweet dataset

FastAPI, Streamlit

GitHub repo: here.

We provide an end-to-end example with the 'Disaster Tweet Classification' example, which classifies a tweet (text string) into whether it is a disaster or not. Previously, we performed an exploratory data analysis and ran model architecture and hyperparameter searches in a separate repo.

In this repo, we implement a) a training pipeline, b) an inference FastAPI backend, and c) an example Streamlit app to query the FastAPI backend.

Machine learning with PySpark: Blackblaze hard drive failure prediction

PySaprk

GitHub repo: here.

We provide an end-to-end PySpark example with the 'Blackblaze hard drive failure prediction' example, in which we predict from the S.M.A.R.T. telemetry from a hard drive whether it has failed. The goal of this exercise is to showcase the use of PySpark to perform data science tasks on a real dataset: from EDA, date cleaning and preprocessing, to feature engineering and scaling, ML model training, to inference.

New York City Taxi Fare Prediction

ML regression problem

Implemented efficient memory management and workflow optimisation in processing large dataset (55M instances) on local machine. Engineered new features, e.g. geo-info including boroughs, zip codes and geodesic distance of each trip. Optimised tree-based and neural network models. Attained best test RMSE of 2.96 with LGBMRegressor. Found the test prediction following closely the training set distribution, with peaks attributed to the flat rate to/from the airports. Performed baseline model evaluation on Google Cloud.

Detroit Blight Ticket On-Time Payment

ML classification problem, with unsupervised ML flavour

Cleaned up dataset, engineered features e.g. zip codes and datetime features. Optimised models: K-neighbours, tree-based models and dense neural networks, evaluated by validation AUC. Best performance attained by RandomForestClassifier. Predicted higher compliance rates in validation set than in training set. Fitted monthly aggregated compliance rate with an AR(1) model. Found dimensionality reduction detrimental to model performance.

Disaster Tweets Classification

Natural language processing

End-to-end example here.

Performed text pre-processing and tokenisation, engineered new features. Utilised NLP techniques: TfidfVectorizer from NLTK, LSTM with GloVe embedding vectors, and Hugging Face BERT transformer. Found optimal LSTM architecture using Keras Tuner. Used both text vectors and other engineered features. Found similar F1 test scores in all approaches, with best score (0.808) from transformer network. Deployed model using Flask.

Predicting Future Sales

Multi-variate forecast with ML techniques

Predicted the sales of ~10k items in different shops. Performed time-conscious feature engineering e.g. using lag features and time-ordered train-validation split. Identified seasonal trends and annual decline in sales. Trained XGBClassifier, LGBMClassifier and deep neural networks. Attained best RMSE on the test set of 1.51.

Climate Change: Global Temperature Forecast

Time series forecast with statistical models and neural networks

Analysed historical data quality. Took the first difference to render the times series stationary. Analysed (partial-)auto-correlation functions, fitted an AR(3) model. Also made use of lag features and fitted LSTM model. Used fixed partitioning and rolling forecast on both models. Attained lowest MAE of 0.147.

Google Sites

Report abuse