This project is done during the statistical learning course at Queen's University. The project is taken from Kaggle.
Project Intro:
Different countries have different cultural, different traditions and different eating habit. Some food could represent for the entire country. If people go to South Korea, they may eat Korean traditional food like Kimchi and jajangmyeon; if people go to North America, they may eat more hamburgers and steaks; if people go to China, they may eat traditional food like hotpots and Beijing Roast Duck. Some of the strongest geographic and cultural associations are highly related to a region’s traditional local foods. In this project, I am asked to predict the category of a dish’s cuisine given a list of ingredients that used in the dish.
Data description:
Training dataset is in JSON format and has three attributes: recipe ID (int), the type of cuisine(object/string), and the list of ingredients (object list).
Test dataset only have two attributes: recipe ID and the list of ingredients. Both attributes have the same format as they appear in Training dataset.
Data Preparation & Feature Engineering:
Since the format of both training and testing dataset is the combination of string and numeric, we have to somehow convert string into numeric value for modelling.
Removing the special characters, numbers and units(if any) in each ingredient, those may affect the true value when converting to numeric.
Standardized words (convert all letters to lower case, remove hyphen, change word to default tense)
Using TF-IDF algorithm to extract some significant(unique) ingredients for each recipe and convert to numeric value, this will be the features for that recipe. The result matrix is a sparse matrix.
Since the matrix after TF-IDF is a spare matrix with a large dimension, we consider SVD method to reduce the dimension. In this case, we perform truncated SVD algorithms.
Modelling:
Logistic Regression with OVR
Support Vector Machine with OVR and linear boundary (using linear boundary due to hardware limitation)
Random Forest
Boosting and voting classifier
Neural Networks(using Keras with three inner layers and output layer via Google Colab)
Hyper-parameter tuning is done via GridSearchCV in Python. Random Forest and boosting & voting classifier is not reported due to low accuracy on validation dataset.
The code for this project can be found at here. You can cited as the following:
@article{mz2021,
title = "Kaggle Competition: What's Cooking",
author = "Zhou, Meng",
year = "2021",
url = "https://sites.google.com/view/mengzhou/project/kaggle-project-1"
}
Thanks for watching!