The Spectrograms

Team Members: Taylor Brown, Gaby Gutierrez, Isabela Magnoni, Kevin Si, Whitley Shields, and George Xie,

Faculty: Dinesh Manocha, Puneet Mathur

I4C Teaching Assistant: Ricky Li

RAVDESS Dataset Overview

Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS)
24 professional voice actors, 12 male, 12 female.
Each actor made 60 recordings, each portraying one of eight emotions.
Two statements are made: "Kids are talking by the door", and "Dogs are sitting by the door".

Project Question

Given audio samples of a person speaking, can a neural network identify which of eight emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprised) is portrayed in the sample?

Action Steps

Cleaning Data:

In order to make our data more usable, we needed to clean it.

Removing NaN

NaN: Not a number

Standardization
Subtract mean, divide by standard deviation.

Standardization: Putting data in an easier range to use.

Train-test split

To test the algorithm on data that it has not seen before.

Conversion to a matrix

Matrix: A data set that is presented in a square. L = W

About Our Neural Network

Neural Networks are a type of machine learning algorithm, inspired by the functioning of our minds. Our brains are made up of many neurons, each receiving input from other neurons, and using those inputs to compute an output. Similarly, a Neural Network is made up of layers, each of which are also made up of many neurons. Neurons recieve an activation strength from its inputs from the previous layer, and computes an output that it will then send to the next layer.

Each connection has a weight, which determines how strong of an impact a neuron has on another neuron. These weights are adjusted during training to optimize the model.

Terminology: The input layer is the layer where the data is fed in. Similarly, the output layer is where the results are. There are layers that are between them, these are called hidden layers.

We used 5 types of layers:

Generic: Takes in a list of inputs, and outputs a list of inputs. This is the most common type of layer.

Dropout: Randomly turns off a certain percentage of neurons from the previous layer. This is used to reduce overfitting.

Convolutional: Given a matrix, returns a list of smaller matrices. Each neuron looks at a certain patch in the input matrix. (In the example above, a 3x3 was passed into 4 neurons, each one looking at their own 2x2 patch.)

Max Pool: Given a matrix, returns a 1x1 matrix with the largest value found in the input. Used for simplifying information, and reducing overfitting.

Flat: Given a matrix, outputs a list. Used for dimensionality reduction.

The diagram below is a simplified view of our model. Dropout percentage = 40%.

Input: A matrix representing the audio.

Output: 8 numbers, each representing an emotion, representing the algorithm's certainty that the audio contains that emotion.

Results

Figure 1 (Accuracy Graph)

Figure 2 (Confusion Matrix)

Figure 3 (Loss Graph)

What Does This Mean?

What you see here at the top left is a Model Evaluation chart. It measures the accuracy of the neural network, which is a series of algorithms that recognize underlying relationships in a set of data through a process that mimics the way the human brain operates. The numbers at the bottom represent the number of Epochs. Epochs are one complete pass of the training dataset through the algorithm. After each complete pass of the training data the accuracy increases.

Below we have a Model loss Chart. It measures how poorly the neural network recognizes the test data. Which in this case is how the machine can recognize what emotion a person is expressing based on how they say the phrases "Kids are talking by the door", and "Dogs are sitting by the door". As the number of epoch increase the model improves because the model loss decreases

The big blue chart on the right is a confusion matrix chart. A confusion matrix visualizes and summarizes the performance of a classification algorithm. Which in this chart it summarizes the relations between what the model predicted and the actual results. The number’s in the graph represents how sure and unsure the machine is. The higher the number is the more accurate the machine is and the lower the number the less accurate the machine is.

Future Applications

- Improvements to current speech recognition AI:
  - Introducing more datasets with actors who have accents other than a North American accent
  - Including datasets with a variety of phrases to better train the model

- Future applications of this neural network:
  - Helping people who struggle with differentiating emotions
  - Having voice applications such as Alexa and Siri adapt based on an emotion a user displays

AI4ALL Speech Emotion Recognition Presentation 2022

Need More Info?

Check out our slides for our contact information and additional insight.

Page updated

Report abuse