Part of our project involves re-implementing a pre-tested algorithm that has been proven to classify songs accurately.
See A Bag-of-Tones Model with MFCC Features for Musical Genre Classification, Qin et. al.
This algorithm involves:
Transforming songs into sets of “tones” determined using Mel Frequency Cepstral (MFC) coefficients extracted from STFT windows
Using the tone composition of songs as a predictor variable for classification using machine learning.
The songs are partitioned into 30-second clips.
From each short-time Fourier Transform window (“point of sound”) of a clip, a vector of 42 MFC Coefficients is extracted.
The coefficient vectors from all of the training clips are clustered into 80 “tones” using k-means clustering.
Each clip is converted into a histogram of the tones it contains: a “bag of tones”. This is a feature vector of length 80.
A ML classifier is trained on the bags of tones and song categories.
For test clips, MFCC vectors are extracted and classified into tones using Euclidean distance from tone centroids.
The classifier predicts the classes of test songs from the test feature vectors.
First, the signal is converted into the frequency domain using a short-time Fourier Transform windowed by a Hamming window.
The spectrum is converted to absolute value.
Next, we must translate the frequency scale from Hz to Mel Frequency.
Mel frequency is equal to Hz for frequencies below 1000 Hz.
Above 1000 Hz the following logarithmic relation applies: Mel(f) = 1127 ln(1 + f/700)
The following triangular filter bank is used to convert the spectrum into Mel Frequency. The filter bank output is a vector of approximately 40 values, so the filter bank does a large amount of dimensionality reduction.
The final two steps in extracting MFCCs are a logarithm and then a Discrete Cosine Transform (DCT). The DCT is performed on the log of the filter bank output, constituting the final two operations to produce the cepstrum. In general, the cepstrum is defined as the "spectrum of the logarithm of the spectrum" of a signal.
The DCT output is truncated to give 13 output coefficients for each window, and the log of the energy of the window is appended to the output vector. In addition to these results, we use the derivatives (with respect to window number) of the output coefficients/log-energy. This gives a total of 42 coefficients per point of sound.
Figure: Triangular filter bank used to convert Hz to Mel Frequency
Once the tone histograms are extracted for each song, this boils down to a 3-class classification problem. We explored various algorithms for it, all implemented in MATLAB: Support Vector Machine (SVM), Naive Bayes Classifier (NB), and K-Nearest Neighbors (KNN). All of these have been covered in class.
We found that the classifier with the best accuracy was KNN with K = 1, i.e. a nearest neighbor classifier.