Emotional Detection From Audio with Neural Nets
Video
Presentation Slides
Project Paper
Project Details
CREMA-D is a data set of 7442 clips
91 actors
48 male 43 female
between ages 20 and 74
varying race and ethnicity
actors spoke from a selection of 12 sentences
Spoke selection of 6 emotions
Four different emotional levels
Feature Extraction
Before processing the audio files we tuned the pitch of each audio file to stay consistent through each actor
We extracted the Mel-frequency cepstrum coefficient of MFCC from each audio file using librosa. An MFCC is used for the purpose of processing auditory information more accurately according to the human auditory system.
Took the average of each MFCC array, giving us a 40 value array for each audio file, representing each audio file
Here is the link array of MFCC data for all 7442 audio files
Building the Neural Networks
We trained 3 separate neural networks on different sets of the data to accurately classify each emotional indicator accurately
We had separate code based on the data we had on each indicator
Each Network had ended up having categorical cross entropy loss to get a distribution of probabilities to derive one single expected value from the distribution
We do this because we want the neural net to detect patterns from every emotion type for middle ground measurements
Arousal Neural Net
# define baseline model
# create model
model = Sequential()
model.add(Conv1D(256, 8, padding='same',input_shape=(X.shape[1],1))) #1
model.add(BatchNormalization()) #batch normalization for faster convergence
model.add(Activation('relu')) #relu is used as a commonly used activation function
model.add(Conv1D(256, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(BatchNormalization()) #batch normalization for faster convergence
model.add(Activation('relu')) #relu is used as a commonly used activation function
model.add(Dropout(0.25)) #dropout for overfitting
model.add(MaxPooling1D(pool_size=(8))) #max pooling to detect differences in high arousal states in audio file
model.add(Conv1D(128, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(BatchNormalization()) #batch normalization for faster convergence
model.add(Activation('relu')) #relu is used as a commonly used activation function
model.add(Conv1D(128, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(BatchNormalization()) #batch normalization for faster convergence
model.add(Activation('relu')) #relu is used as a commonly used activation function
model.add(BatchNormalization()) #batch normalization for faster convergence
model.add(Conv1D(128, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(BatchNormalization()) #batch normalization for faster convergence
model.add(Activation('relu')) #relu is used as a commonly used activation function
model.add(Dropout(0.25)) #dropout for overfitting
model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(Activation('relu')) #relu is used as a commonly used activation function
model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(Activation('relu')) #relu is used as a commonly used activation function
model.add(Flatten()) #makes output 1 dimensional for output layer
model.add(Dense(dummy_y.shape[1])) #output layer based on the possible outputs
model.add(Activation('softmax')) #activation function to create probability vector
#using SGD for faster convergence
opt = keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
We added the most layers to this network because our data contains indicators that range from -4 to 4, making it have the most variance when it comes to classification.
Valence Neural Net
model = Sequential()
model.add(Conv1D(256, 8, padding='same',input_shape=(X.shape[1],1))) #1
model.add(BatchNormalization())#batch normalization for faster convergence
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(Conv1D(256, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(BatchNormalization())#batch normalization for faster convergence
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(Dropout(0.25)) #dropout for overfitting
model.add(MaxPooling1D(pool_size=(8)))#max pooling to detect differences in high arousal states in audio file
model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(BatchNormalization())#batch normalization for faster convergence
model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(Flatten())#makes output 1 dimensional for output layer
model.add(Dense(dummy_y.shape[1])) #output layer based on the possible outputs
model.add(Activation('softmax'))#activation function to create probability vector
#using SGD for faster convergence
opt = keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
Valence only had 4 possible values of [-3,-2,0,3]. This led us to build a shallow network compared to the other two, since we wanted to prevent any overfitting.
Approach Motivation Neural Net
model = Sequential()
model.add(Conv1D(256, 8, padding='same',input_shape=(X.shape[1],1))) #Conv 1D used for Audio data in one dimension
model.add(BatchNormalization())#batch normalization for faster convergence
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(Conv1D(256, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(BatchNormalization())#batch normalization for faster convergence
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(Conv1D(256, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(BatchNormalization())#batch normalization for faster convergence
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(Conv1D(256, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(BatchNormalization())#batch normalization for faster convergence
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(Conv1D(128, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(BatchNormalization())#batch normalization for faster convergence
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(Dropout(0.25)) #dropout for overfitting
model.add(MaxPooling1D(pool_size=(8)))#max pooling to detect differences in high arousal states in audio file
model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(BatchNormalization())#batch normalization for faster convergence
model.add(Conv1D(64, 8, padding='same')) #Conv 1D used for Audio data in one dimension
model.add(Activation('relu'))#relu is used as a commonly used activation function
model.add(Flatten())#makes output 1 dimensional for output layer
model.add(Dense(dummy_y.shape[1])) #output layer based on the possible outputs
model.add(Activation('softmax'))
#using SGD for faster convergence
opt = keras.optimizers.SGD(lr=0.1, momentum=0.0, decay=0.0, nesterov=False)
# Compile model
model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy'])
Approach Motivation had 5 possible values [-3,-2,-1,0,3] so we made it's depth be in the middle ground between Arousal and Valence.
Data Post Processing
Once we get an array of probabilities from the neural nets, we then get the expected value from the probability array to get the classification for the emotional indicator:
#array of each value for indicators that do not have values that range from -4 to 4
valenceArray = [-3,-2,0,3]
appMotArray = [-3,-2,-1,0,3]
#deriving average values for each indicator and appending it to a predication array
emotion_plots = [] #array of all indicators
index = 0 #index of the array
X = mfcc_processed
#######################
#VALENCE PREDICTIONS
#######################
for row in loaded_modelValence.predict(X):
val = 0 #location in the possible values array
totalSum = 0 # total sum to show expected value
for probability in row:
predVal = valenceArray[val]
totalSum += predVal*probability #appending each probability distribution
val += 1
emotion_plots.append([totalSum]) #appending first dimension of indicator to prediction array
index = 0 #setting index to zero to calculate second dimension of indicator
#######################
#APPROACH MOTIVATION PREDICTIONS
#######################
for row in loaded_modelAppMot.predict(X):
val = 0
predVal = appMotArray[val]
totalSum = 0
for probability in row:
predVal = appMotArray[val]
totalSum += predVal*probability
val += 1
emotion_plots[index].append(totalSum) #appending first dimension of indicator to prediction array
index += 1
index = 0 #setting index to zero to calculate third dimension of indicator
#######################
#AROUSAL PREDICTIONS
#######################
for row in model.predict(X):
val = -4 #value ranges from -4 to 4, so we can just increment
totalSum = 0 # total sum to show expected value
for probability in row:
totalSum += val*probability #appending each probability distribution
val += 1
emotion_plots[index].append(totalSum) #appending first dimension of indicator to prediction array
index += 1
We then run knn with an n of 1 on the prediction array. We use the 6 emotional values as the training data:
#printing emotion predictions
#creating data for knn
X = [[3, 3, 3], [0, 0, 0], [-3, -1, -1], [-3, 3, 3], [-1, -3, -3], [0, -2, 2]] # training data for each indicator
nameArrays = ['happy', 'neutral', 'sad', 'anger', 'disgust', 'fear'] #corresponding emotion
y = [1, 2, 3, 4, 5, 6] #need to use integers for corresponding knn classification
#setting up knn classifiers
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1) #creating knn classifier
neigh.fit(X, y) #fitting classifier to the array of each indicator
#predicting emotion with knn
neigh.predict(emotion_plots)
Final Data
The final graph shows the processed test values of the audio files all on a 3-D graph with the axes being all 3 emotional indicators. We set up a visualization of all the clusters with the classified points colored based on their predicted emotion with knn. Below is a video of the visualization, for more details on the visualization here is a link to the cell of this visualization.
Some stats for how accurate the classification was:
Average acc: 0.816655473472129
anger: 0.91796875
sad: 0.7047244094488189
neutral: 0.7431192660550459
disgust: 0.8854961832061069
fear: 0.8647540983606558
happy: 0.7725490196078432