Emotional Detection From Audio with Neural Nets

Video

Presentation Slides

AML Final Project Video Slides

Project Details

  • CREMA-D is a data set of 7442 clips

  • 91 actors

  • 48 male 43 female

  • between ages 20 and 74

  • varying race and ethnicity

  • actors spoke from a selection of 12 sentences

  • Spoke selection of 6 emotions

  • Four different emotional levels

Feature Extraction

  • Before processing the audio files we tuned the pitch of each audio file to stay consistent through each actor

  • We extracted the Mel-frequency cepstrum coefficient of MFCC from each audio file using librosa. An MFCC is used for the purpose of processing auditory information more accurately according to the human auditory system.

  • Took the average of each MFCC array, giving us a 40 value array for each audio file, representing each audio file

  • Here is the link array of MFCC data for all 7442 audio files

Building the Neural Networks

  • We trained 3 separate neural networks on different sets of the data to accurately classify each emotional indicator accurately

  • We had separate code based on the data we had on each indicator

  • Each Network had ended up having categorical cross entropy loss to get a distribution of probabilities to derive one single expected value from the distribution

    • We do this because we want the neural net to detect patterns from every emotion type for middle ground measurements

Arousal Neural Net

# define baseline model

# create model

model = Sequential()

model.add(Conv1D(256, 8, padding='same',input_shape=(X.shape[1],1))) #1

model.add(BatchNormalization()) #batch normalization for faster convergence

model.add(Activation('relu')) #relu is used as a commonly used activation function

model.add(Conv1D(256, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(BatchNormalization()) #batch normalization for faster convergence

model.add(Activation('relu')) #relu is used as a commonly used activation function

model.add(Dropout(0.25)) #dropout for overfitting

model.add(MaxPooling1D(pool_size=(8))) #max pooling to detect differences in high arousal states in audio file

model.add(Conv1D(128, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(BatchNormalization()) #batch normalization for faster convergence

model.add(Activation('relu')) #relu is used as a commonly used activation function

model.add(Conv1D(128, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(BatchNormalization()) #batch normalization for faster convergence

model.add(Activation('relu')) #relu is used as a commonly used activation function

model.add(BatchNormalization()) #batch normalization for faster convergence

model.add(Conv1D(128, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(BatchNormalization()) #batch normalization for faster convergence

model.add(Activation('relu')) #relu is used as a commonly used activation function

model.add(Dropout(0.25)) #dropout for overfitting

model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(Activation('relu')) #relu is used as a commonly used activation function

model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(Activation('relu')) #relu is used as a commonly used activation function

model.add(Flatten()) #makes output 1 dimensional for output layer

model.add(Dense(dummy_y.shape[1])) #output layer based on the possible outputs

model.add(Activation('softmax')) #activation function to create probability vector

#using SGD for faster convergence

opt = keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)

# Compile model

model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

We added the most layers to this network because our data contains indicators that range from -4 to 4, making it have the most variance when it comes to classification.

Valence Neural Net

model = Sequential()

model.add(Conv1D(256, 8, padding='same',input_shape=(X.shape[1],1))) #1

model.add(BatchNormalization())#batch normalization for faster convergence

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(Conv1D(256, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(BatchNormalization())#batch normalization for faster convergence

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(Dropout(0.25)) #dropout for overfitting

model.add(MaxPooling1D(pool_size=(8)))#max pooling to detect differences in high arousal states in audio file

model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(BatchNormalization())#batch normalization for faster convergence

model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(Flatten())#makes output 1 dimensional for output layer

model.add(Dense(dummy_y.shape[1])) #output layer based on the possible outputs

model.add(Activation('softmax'))#activation function to create probability vector

#using SGD for faster convergence

opt = keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)

# Compile model

model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

Valence only had 4 possible values of [-3,-2,0,3]. This led us to build a shallow network compared to the other two, since we wanted to prevent any overfitting.

Approach Motivation Neural Net

model = Sequential()

model.add(Conv1D(256, 8, padding='same',input_shape=(X.shape[1],1))) #Conv 1D used for Audio data in one dimension

model.add(BatchNormalization())#batch normalization for faster convergence

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(Conv1D(256, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(BatchNormalization())#batch normalization for faster convergence

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(Conv1D(256, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(BatchNormalization())#batch normalization for faster convergence

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(Conv1D(256, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(BatchNormalization())#batch normalization for faster convergence

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(Conv1D(128, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(BatchNormalization())#batch normalization for faster convergence

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(Dropout(0.25)) #dropout for overfitting

model.add(MaxPooling1D(pool_size=(8)))#max pooling to detect differences in high arousal states in audio file

model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(BatchNormalization())#batch normalization for faster convergence

model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension

model.add(Activation('relu'))#relu is used as a commonly used activation function

model.add(Flatten())#makes output 1 dimensional for output layer

model.add(Dense(dummy_y.shape[1])) #output layer based on the possible outputs

model.add(Activation('softmax'))

#using SGD for faster convergence

opt = keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)

# Compile model

model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])

Approach Motivation had 5 possible values [-3,-2,-1,0,3] so we made it's depth be in the middle ground between Arousal and Valence.

Data Post Processing

Once we get an array of probabilities from the neural nets, we then get the expected value from the probability array to get the classification for the emotional indicator:

#array of each value for indicators that do not have values that range from -4 to 4

valenceArray = [-3,-2,0,3]

appMotArray = [-3,-2,-1,0,3]


#deriving average values for each indicator and appending it to a predication array

emotion_plots = [] #array of all indicators

index = 0 #index of the array

X = mfcc_processed

#######################

#VALENCE PREDICTIONS

#######################

for row in loaded_modelValence.predict(X):

val = 0 #location in the possible values array

totalSum = 0 # total sum to show expected value

for probability in row:

predVal = valenceArray[val]

totalSum += predVal*probability #appending each probability distribution

val += 1

emotion_plots.append([totalSum]) #appending first dimension of indicator to prediction array

index = 0 #setting index to zero to calculate second dimension of indicator

#######################

#APPROACH MOTIVATION PREDICTIONS

#######################

for row in loaded_modelAppMot.predict(X):

val = 0

predVal = appMotArray[val]

totalSum = 0

for probability in row:

predVal = appMotArray[val]

totalSum += predVal*probability

val += 1

emotion_plots[index].append(totalSum) #appending first dimension of indicator to prediction array

index += 1

index = 0 #setting index to zero to calculate third dimension of indicator

#######################

#AROUSAL PREDICTIONS

#######################

for row in model.predict(X):

val = -4 #value ranges from -4 to 4, so we can just increment

totalSum = 0 # total sum to show expected value

for probability in row:

totalSum += val*probability #appending each probability distribution

val += 1

emotion_plots[index].append(totalSum) #appending first dimension of indicator to prediction array

index += 1

We then run knn with an n of 1 on the prediction array. We use the 6 emotional values as the training data:

#printing emotion predictions

#creating data for knn

X = [[3, 3, 3], [0, 0, 0], [-3, -1, -1], [-3, 3, 3], [-1, -3, -3], [0, -2, 2]] # training data for each indicator

nameArrays = ['happy', 'neutral', 'sad', 'anger', 'disgust', 'fear'] #corresponding emotion

y = [1, 2, 3, 4, 5, 6] #need to use integers for corresponding knn classification

#setting up knn classifiers

from sklearn.neighbors import KNeighborsClassifier

neigh = KNeighborsClassifier(n_neighbors=1) #creating knn classifier

neigh.fit(X, y) #fitting classifier to the array of each indicator


#predicting emotion with knn

neigh.predict(emotion_plots)

Final Data

The final graph shows the processed test values of the audio files all on a 3-D graph with the axes being all 3 emotional indicators. We set up a visualization of all the clusters with the classified points colored based on their predicted emotion with knn. Below is a video of the visualization, for more details on the visualization here is a link to the cell of this visualization.

Some stats for how accurate the classification was:

Average acc: 0.816655473472129

anger: 0.91796875

sad: 0.7047244094488189

neutral: 0.7431192660550459

disgust: 0.8854961832061069

fear: 0.8647540983606558

happy: 0.7725490196078432

2020-12-10 15-10-15.mp4

Code link

To the left is the link to our google Collaboratory work environment. To run some cells, you need to import all the neural net files and some excel data arrays for download below:

Some Notes:

For the data processing step we used this drive as a place to store all of our project related data. We mounted the drive in colab and ran through each audio recording until we had an array of MFCC's. This array took a while to construct, so we stored it in an excel file titled mfccdata1.xlsx, contained in the zip file linked above.

All other cells except for the one that reads from the drive are runnable, however, the training cell takes very long to run.