Sentiment

Team: Debanshi Misra, Massai Morgan, Rachel Zhao, Sai Hruthika Naraharisetti, Vishesh Narayan

Faculty Advisor: Prof. Dinesh Manocha

Graduate Students: Pooja Guhan, Rohan Chandra, Trisha Mittal

AI4ALL Facilitators: Elaine Gao, Harpreet Multani


Project Question:

Given a tweet from a user, can we predict the sentiment expressed in the tweet?

Words that predict a negative sentiment !

Words that predict a positive sentiment !

Project Overview

In this project, we explore different techniques for analyzing sentiments expressed in Twitter data. An increasing number of people are using online social platforms like Twitter, Facebook, and Reddit to express opinions about various topics, products, and events. Automatically extracting sentiment from text data is useful for various applications. For instance, we can personalize user's feeds based on their opinions on a variety of topics. Similarly, marketing agencies can use sentiment analysis to research the public's opinion of their company or grasp customer opinions on products. Throughout this project, we utilized standard data-processing techniques on real-world data to develop and train various learning models, ultimately reaching a conclusion about the most effective way to analyze sentiment.

Actions Steps


  1. Dataset Exploration

We first spent time on understanding our dataset. We obtained a dataset from Kaggle, called Sentiment140 Dataset. This dataset contains information from 1.6 million tweets.

Each datapoint in the dataset has an associated tweet text and an associated label, 'negative sentiment', 'positive sentiment', and 'neutral sentiment'.


2. Dataset Processing

As a part of pre-processing and dataset cleaning, we performed the following 4 operations.

(a) converted all text to lowercase,

(b) removed stop-words,

(c) performed stemming on the data, and

(d) removed punctuation marks from all tweets in the dataset.

For doing these operations we explored various functions from two libraries, NLTK (Natural Language ToolKit) and String. We understood the importance of cleaning and processing the dataset before building learning models.

3. Feature Extraction

After cleaning the dataset, the next step is to prepare the dataset to be used as an input to the learning models.

For this we learnt and coded two feature extraction methods.

(a) Bag-Of-Words Model

A bag of words is a representation of text that describes the occurrence of words within a document. We just keep track of word counts. It is called a “bag” of words because any information about the order or structure of words in the document is discarded. The model is only concerned with whether known words occur in the document, not where in the document.

(b) Term Frequency-Inverse Document Frequency (TF-IDF) Model


This feature extraction method builds on top of the Bag-Of-Words model. While we still focus on the frequency of the words in a document, but now we also take into account how much we want to weigh each word in the document. The underlying idea is to give less importance to the words that are common across different documents/tweets and more importance to words that are less common.


To implement both of these models, we used the sklearn library.

4. Machine Learning Models

We implemented three different classical machine learning models.


(a) Logistic Regression

Linear model based on probability of a data point belonging to a certain class.


(b) Naive Bayes

Based on the Bayes Rule where we use conditional probability to predict the likelihood of a certain label for every data point.


(c) Support Vector Machines

Linear classification model based on geometry and structure of training data.

5. Deep Learning Models

Finally, we familiarised ourselves to the deep learning environment, PyTorch. We built two models.

(a) Recurrent Neural Network (RNN)

This is a sequential neural network, which works well for data like words in a sentence. We built the model as shown below and trained the model for 100 epochs. We played around with varying hyperparameters till we got a stable accuracy and a steadily decreasing loss function.

(b) Long Short Term Memory (LSTM) Network

LSTM is a specialized recurrent network which can handle long-term dependencies better. The model decides what part of memory to save and use for prediction rather than saving the entire sentence.

Results

We summarize our findings here. In this table below, we present our results for the three classical machine learning algorithms that we learnt and coded.

And, finally, in this table we summarise the results of our deep learning models, RNN model and a LSTM model along with the hyperparameters we used for training our models.

We obtain an accuracy of 41% and 86%, respectively.

Presentation Slides

Sentiment Presentation