My team participated in the Kaggle competition: NLP with Disaster Tweets. The competition involved building a machine learning model for predicting whether tweets are related to real disasters or not.
Given the ubiquity of smartphones and the use of Twitter for real-time communication during emergencies, the challenge is to programmatically distinguish between tweets that genuinely indicate a disaster and those that use disaster-related words metaphorically.
My Contributions
Data cleaning: Used multiple python libraries (pandas, neatext, nltk) to normalize and clean text, fix contractions, and remove: stopwords, emoji's and links. Lemmatization and accounted for duplicate and retweets.
Visualizations: Used multiple python libraries (Matplotlib, Seaborn, plotly) to create bar plots and histograms during our exploratory data analysis stage.
Natural Language Processing:
Sentiment Analysis: Implemented sentiment analysis using (nltk) to assess and classify the emotional tone expressed in the text as positive, negative, or neutral.
Vectorization: Employed vectorization techniques from(sklearn)to convert text data into a format suitable for machine learning algorithms. This step is crucial for transforming textual information into numerical features that can be utilized by machine learning models.
Machine Learning : Leveraged machine learning algorithms from the (sklearn) library to develop models for various tasks related to the project. This may include classification, regression, or other relevant tasks based on project requirements.
Deep Learning: Explored deep learning methodologies using state-of-the-art tools (BERT, Keras, and TensorFlow). Implemented deep learning models to capture intricate patterns and nuances within the textual data, enhancing the project's overall predictive capabilities.
Outcome
Using DistilBERT we were able to yield 83.87% accuracy and 79.38% F1 score. Landing in the top 15% for the kaggle competition.