I am interested in extracting useful knowledge and information from the ocean of data, be it in form of text, audio, equations or images. Combining the use of data, the scientific method and analytical thinking, I aspire to assist people in making well-informed decisions in all aspects of life. By helping advance data science and artificial intelligence, and advocating for their responsible and equitable use, I hope to help raise the quality of life of all.
See the CV page for my relevant skill set.
(Apparently, I have done much more than these on the job since then.)
FastAPI, Streamlit
GitHub repo: here.
We provide an end-to-end example with the 'Disaster Tweet Classification' example, which classifies a tweet (text string) into whether it is a disaster or not. Previously, we performed an exploratory data analysis and ran model architecture and hyperparameter searches in a separate repo.
In this repo, we implement a) a training pipeline, b) an inference FastAPI backend, and c) an example Streamlit app to query the FastAPI backend.
PySaprk
GitHub repo: here.
We provide an end-to-end PySpark example with the 'Blackblaze hard drive failure prediction' example, in which we predict from the S.M.A.R.T. telemetry from a hard drive whether it has failed. The goal of this exercise is to showcase the use of PySpark to perform data science tasks on a real dataset: from EDA, date cleaning and preprocessing, to feature engineering and scaling, ML model training, to inference.
Implemented efficient memory management and workflow optimisation in processing large dataset (55M instances) on local machine. Engineered new features, e.g. geo-info including boroughs, zip codes and geodesic distance of each trip. Optimised tree-based and neural network models. Attained best test RMSE of 2.96 with LGBMRegressor. Found the test prediction following closely the training set distribution, with peaks attributed to the flat rate to/from the airports. Performed baseline model evaluation on Google Cloud.
Cleaned up dataset, engineered features e.g. zip codes and datetime features. Optimised models: K-neighbours, tree-based models and dense neural networks, evaluated by validation AUC. Best performance attained by RandomForestClassifier. Predicted higher compliance rates in validation set than in training set. Fitted monthly aggregated compliance rate with an AR(1) model. Found dimensionality reduction detrimental to model performance.
Performed text pre-processing and tokenisation, engineered new features. Utilised NLP techniques: TfidfVectorizer from NLTK, LSTM with GloVe embedding vectors, and Hugging Face BERT transformer. Found optimal LSTM architecture using Keras Tuner. Used both text vectors and other engineered features. Found similar F1 test scores in all approaches, with best score (0.808) from transformer network. Deployed model using Flask.
Predicted the sales of ~10k items in different shops. Performed time-conscious feature engineering e.g. using lag features and time-ordered train-validation split. Identified seasonal trends and annual decline in sales. Trained XGBClassifier, LGBMClassifier and deep neural networks. Attained best RMSE on the test set of 1.51.
Analysed historical data quality. Took the first difference to render the times series stationary. Analysed (partial-)auto-correlation functions, fitted an AR(3) model. Also made use of lag features and fitted LSTM model. Used fixed partitioning and rolling forecast on both models. Attained lowest MAE of 0.147.