The Emotionizer

Team Speak Up: Sama Karim, Shreya Tangirala, Excellence Herbert, Elaina Hill, Maryam Lim-Baig, Adithi Vardhan

Faculty: Puneet Mathur

AI4ALL Facilitators: Renee Sen, Maryann Vazhapilly

Project overview

Humans are able to detect emotions in another's speech through tone, pitch, and variation. However, most AI products on the market such as Alexa and Siri are not there yet. Our project was focused on speech emotion recognition: the ability of the AI to recognize emotions in a speech sample.

Project question

Can students develop a Convolutional Neural Network that predicts the emotion present in a speech sample?

What Is Speech Emotion Recognition

Speech Emotion Recognition is the act of attempting to recognize human emotions through speech samples. The Convolutional Neural Network extracts certain features from the input data to find patterns with the respective emotion through the user's tone, pitch, frequency, and more; however, an underlying problem with SER is that emotion can be subjective so some results may be inaccurate.

action steps

We first analyzed, cleaned, normalized, and prepared the audio data using many Python libraries including Librosa.


As part of data processing, we converted our data to time series (so that we could extract audio features) and then created spectrograms to visualize the audio data.

We then created a convolutional neural network (CNN) and trained the model on the training data. Afterwards, we tested our model on the test data and evaluated it using visual representations (graphs and plots) and confusion matrices.

results

The training dataset was able to achieve an accuracy of 80%, while the testing data peaked at 55%. Although the training accuracy was consistent most of the time, the testing accuracy performed a little poorly in comparison. We can potentially change this by altering values or adding more layers in our CNN. In addition, we can change the batch size and epochs to receive better results and a better accuracy.


In the graph displaying training loss (pictured on the left), we can see the training loss decreased steeply for the training set and has an overall downward trend (but less steep) for the test set. This is also a positive indication of our model.


A Confusion Matrix is used to visualize the accuracy and precision of an algorithm or neural network. It displays the predicted and actual labels of data points and displays the accurate predictions on the diagonal.


Since the diagonal of our confusion matrix (pictured on the right) is dark colored with the most data points, our model was very strong/accurate.


The confusion matric also displays other trends including the fact that the neural network had trouble differentiating between surprised and fearful voices and sad and calm voices.



Learn more about Team Speak Up's presentation here:

AI4ALL SPEECH
zoom_0.mp4

Thank you for visiting our site! Please fill out our feedback form if you have any thoughts: