Phase I
Sound recognition has emerged as an exciting new area of innovation in artificial intelligence. Everyday applications like Shazam, Siri, and Alexa all benefit from sound recognition technology. The sub-field of speech recognition generates the most interest from app developers. Annotated speech data for training and testing is widely available.
An area where developers lack high quality training data is environmental event audio. Environmental sounds include such things as breaking glass, running water, or a crying baby. Sound recognition of these audio events can inform smart devices, the hearing-impaired, and security systems [1].
The goal of this project is to create a deep learning neural network with high accuracy of sound recognition. Challenges of this project include the difficulty of designing ensemble neural networks and the lack of available high volume annotated training data. An additional goal would be to see which features improve the performance of a model. A future goal would be to develop a plan to implement this neural network into internet sound banks like freesound.org, soundly.org, and splice to build infinite data training sets.
Improvements in sound classification technology could further enhance:
Music industry (education, performance, and composition)
Smart device capabilities (Home Security, Internet-Of-Things (I.O.T.), Autonomous Vehicle)
Assistance for the hearing impaired
EEG medical analysis
For my undergraduate degree, I studied the art of Jazz Drum Set Performance. I have always been interested in digital audio workstations (DAW) like Logic Pro X, Pro Tools, and Abelton Live, and how these DAWs convert raw sound into .wav files. This project allows me to combine my two passions of music and data science.
The data source for this project is the UrbanSound8K data set. It contains 8732 labeled sound excerpts of urban sounds from 10 classes: air conditioner, car horn, children playing, dog bark, drilling, engine idling, gun shot, jackhammer, siren, and street music.
The audio files are stored in .wav format. The data set also includes meta data for each sound file.
https://urbansounddataset.weebly.com/urbansound8k.html
As an extra bonus, I plan to implement personal field recordings of environmental events to test the accuracy of the neural network.
In 2015, Karol J. Picak assessed the performance of convolutional neural networks (CNNs) on sound pattern recognition. She found that CNN's, strong in identifying image based patterns, were also highly successful in identifying everyday environmental noises [2].
In 2016, Justin Salamon and Juan Pablo Bello found that the accuracy of deep convolutional neural network increases with additional training data. They achieved this by testing the model's performance on a shallow data set and comparing the results to an augmented data set [3].
In 2017, Muhammed Huzaifah compared different methods of obtaining visual representations of an audio signal. The methods included short-time fourier transform (STFT) with linear and Mel scales, constant-Q transform, and continuous Wavelet transformation. Huzaifah found that Mel scale STFT slightly outperformed the other methods and all methods outperformed the Mel Frequency Cepstral Coefficients [4].
Phase 1
[1] M. Smales, "Sound Classification using Deep Learning," February 27, 2019. [Online]. Available: https://medium.com/@mikesmales/sound-classification-using-deep-learning-8bc2aa1990b7. [Accessed Sept. 21, 2020].
[2] K. Piczak, "Environmental Sound Classification with Convolutional Neural Networks," Sept 17, 2015. [Online]. Available: https://www.karolpiczak.com/papers/Piczak2015-ESC-ConvNet.pdf. [Accessed Sept. 23, 2020].
[3] J. Salamon, et. al. "Deep Convolutional Neural Networks and Data Augmentation for Environmental Sound Classification," November 2016. [Online]. Available: https://arxiv.org/pdf/1608.04363.pdf. [Accessed Sept. 23, 2020].
[4] M. Huzaifah, "Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks," June 22, 2017. [Online]. Available: https://arxiv.org/pdf/1706.07156.pdf. [Accessed Sept. 20, 2020].
[5] N. Srinivasan, "Building an Audio Classifier using Deep Neural Networks," December 2017. [Online]. Available: https://www.kdnuggets.com/2017/12/audio-classifier-deep-neural-networks.html. [Accessed Sept. 22, 2020].
EDA and initial model testing.
Further examination of the data set showed two main differences between sound files [6]. While some files contained long continuous sounds (jackhammer, drilling, engine idling), other sounds were more sporadic in nature (gun shot, car horn, dog bark). I used Librosa, a Python library, to visualize the sound waves. Transforming audio files into visual representations through spectrograms makes feature extraction much easier [7].
(203356-3-0-1.wav)
(203356-3-0-1.wav)
(33340-7-13-0.wav)
(33340-7-13-0.wav)
The Mel Frequency Cepstral Coefficient (MFCC) was extracted for all 8732 files through a function which parse from the metadata dataset. I chose MFCC because it consistently performs well in CNNs [8].
The model was successfully run. Prediction accuracy was 62%. My goal is to include an ensemble network comprised of Recurrent Neural Networks and Conventional Neural Networks and reach 90% + .
Mel Frequency Cepstral Coefficient
Dog Bark (203356-3-0-1.wav)
Mel Frequency Cepstral Coefficient
Jack Hammer (33340-7-13-0.wav)
[6] J. Soloman et al, "UrbanSound8K," November, 2014. [Online]. Available: https://urbansounddataset.weebly.com/urbansound8k.html. [Accessed October 10, 2020].
[7] N. Srinivasan, "Building an Audio Classifier using Deep Neural Networks," December, 2017. [Online]. Available: https://www.kdnuggets.com/2017/12/audio-classifier-deep-neural-networks.html. [Accessed Oct. 21, 2020].
[8] S. Saha, "A Comprehensive Guide to Convolutional Neural Networks -- the ELI5 way," December 15, 2018. [Online]. Available: https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53. [Accessed Oct. 25, 2020].
The focus of phase 3 was to finalize the model(s), come to a conclusion, and elaborate what could be done differently in the future.
I chose to implement an Artificial Neural Network (ANN), a Recurrent Neural Network (RNN), and a Convolutional Neural Network (CNN). Each model required plenty of fine tuning for optimal performance.
This was my first time building, optimizing, and fine tuning a neural network so there were plenty of challenges and bugs. A huge challenge was identifying and reshaping the correct input shape for the RNN. The RNN requires a 3 dimensional array, while the CNN requires a 4 dimensional array.
I spent a lot of time tinkering with the layers of each network to improve the performance. I was pleased with the end result.
The Convolutional Neural Network (CNN) performed consistently better than the ANN and the RNN. The CNN performs best with image classification because they can "reduce the number of parameters without losing on the quality of models" (1). If I was building a sound recognition device, I would feel most confident with a Convolutional Neural Network.
Increase the accuracy to 90%.
Explore additional audio features to feed into the models.
Build a Convolutional Recurrent Neural Network.
[9] P. Mishra, "Why are Convolutional Neural Networks good for image classification," May, 2019. [Online]. Available: https://medium.com/datadriveninvestor/why-are-convolutional-neural-networks-good-for-image-classification-146ec6e865e8#:~:text=diminished%20by%20CNNs.-,Network,above%20described%20abilities%20of%20CNNs. [Accessed Dec. 3, 2020].