Data Analysis

Objective

Objective of the analysis was emotion classification for tweets into five following classes:

['Neutral','Happy','Sad','Love','Anger']

Dataset

Data set comprised of 55,775 tweets with 13 labels :

"empty" # neutral

"sadness" # sad

"enthusiasm"# happy

"neutral" # neutral

"worry" # sad

"surprise" # happy

"love" # love

"fun" # happy

"hate" # anger

"happiness" # happy

"boredom" # neutral

"relief" # happy

"anger"#anger

These 13 classes were merged to form 5 classes according to our objective.

Then the data was split in 44,620 Training tweets and 11,155 Validation Tweets.

Model

We have tested the following models:

Model 1: Multinomial Naive Bayes Classifier - Accuracy 38.37%

Model 2: Linear SVM - Accuracy 38.49%

Model 3: Logistic Regression - Accuracy 40.13%

Model 4: Bidirectional LSTMs - Accuracy 62.83%

Based on the above we decided to use Bidirectional LSTMs for our objective.

Embedding

Embeddings are numerical representations of words to represent relationship between words.

We have used GloVe Twitter 200D embedding (1.2GB) with 50k words.

Bidirectional LSTMs

Bidirectional LSTMs have shown better grasp of context resulting in better accuracy. In this method we have two LSTMs. One LSTM is trained on normal input while the other is trained on reversed input.

The model was trained for 10 epochs.

Page updated

Report abuse