Acknowledgment: "This material is based upon work supported by the National Science Foundation under Grant No. 1617107 and No. 1617497."
Acknowledgment: We acknowledge NVIDIA's Titan X Pascal GPU donation for this research.
More information about this project can be accessed via here.
Retrieving Sounds by Vocal Imitation Recognition
Vocal imitation is widely used in human communication. To convey concepts of sounds that have a semantic meaning (e.g., dog barking or car horn), vocal imitation can help narrow down the concepts. For example, there are many kinds of dog barks, and vocal imitation can help distinguish infantile barks from “Christmas tree” barks. For sounds that do not have a definite semantic meaning (e.g., sounds from a synthesizer), vocal imitation is often the only way to convey the concepts.
Unsupervised Feature Extraction
Vocal imitation conveys rich information covering many acoustic aspects: pitch, loudness, timbre, their temporal evolution, and rhythmic patterns, etc. To imitate different sounds, people often imitate different aspects, those that mostly characterize the sound. Identifying characteristic aspects for imitations is difficult. In some cases, imitators may even not be able to describe the aspect(s) they imitate. Therefore, finding features to represent these unclear aspects is very challenging. We use a two-hidden-layer stacked auto-encoder to learn features from a set of vocal imitations in an unsupervised way. Features learned in this way characterize acoustic aspects that human often imitate.
(a) 1st hidden layer features (b) 2nd hidden layer features
Figure 1 Feature extraction visualization. Lighter color represents higher energy.
In order to obtain the satisfying feature extraction performance, we set the number of neurons in the first and second hidden layers to 500 and 100, respectively. Weights connected from the previous layer to each neuron compose a feature. Therefore, there are 500 and 100 features in the first and second hidden layers, respectively. Figure 4 visualizes these features. Due to the limited space, we only display the first 100 out of 500 features in the first hidden layer. We can see that the first hidden layer extracts features that act as building blocks of the CQT spectrogram. The feature for each neuron in the second hidden layer is obtained by a weighted linear combination of features of the first hidden layer neurons to which it is strongly connected. These features are more abstract.
Multi-class Classification & Sound Retrieval
We propose a supervised approach to recognize vocal imitations. For each sound concept, we assume that a number of vocal imitations are available for training. A multi-class Support Vector Machine (SVM) is employed to learn to discriminate vocal imitations of different sound concepts. Then the classifier is able to classify a new vocal imitation to one of these trained sound concepts. For a new vocal imitation whose underlying sound concept is unknown, the multi-class SVM classifies each patch of it to one of the trained sound concepts. Then majority vote is conducted to obtain the recording-level classification.
Figure 2. Illustration of recording-level classification calculation.
Given the classified sound concept, sounds of this concept can be retrieved. However, the returned concept may not always be correct. Therefore, in addition to the binary classification output, we also obtain a probabilistic classification output, showing the probability (confidence) that the vocal imitation patch belongs to each of the trained sound concepts. We then sort sound concepts according to their classification probabilities from high to low, and return sounds of highly-ranked concepts. For sound concepts at the recording level, we average the probability output over all the patches in one imitation, and then sort sound concepts according to the averaged classification probability from high to low. Again, sounds of highly-ranked concepts can be retrieved. Figure 2. illustrates the overall process, where the n-th imitation is comprised of a series of patches. Each patch has its own classification label and probability vector. The label appeared most frequently is chosen as the recording-level classification label. The probability vectors are averaged to obtain an average probability vector, based on which sound concepts are ranked.
Experimental Results
We use the VocalSketch Data Set v1.0.4 [1] in our experiments. This dataset contains sound concepts and their vocal imitation recordings in four categories: Acoustic instruments, Commercial synthesizers, Everyday, and Single synthesizer, which contain 40, 40, 120, and 40 sound concepts, respectively. We use vocal imitations of the first half of all the sound concepts (ordered alphabetically) to train the stacked auto-encoder for feature learning, and use the second half to train and test the multi-class classifier within each category. This prevents the proposed system from over-fitting imitations that have been used for feature learning. We use two measures to evaluate the system performance: 1) classification accuracy, for vocal imitation classification; 2) Mean Reciprocal Rank (MRR). Here we adopt 10-fold cross validation to avoid over-fitting and use MFCC features as the baseline system.
Table 1 Recording-level 10-fold cross validation results
Conclusions
Download code here.