We trained our classifier using 15 songs for each genre and tested it with 10 songs for each genre. The songs were all split into 30-second clips, and an equal number of clips were used for each genre.
Because the k-means cluster center initialization depends on random numbers, our classifier has vastly varying results. We display here the true positive rates over 6 repeated runs of the algorithm.
We also had a goal of integrating the output of the other algorithms with the MFCC feature vector. This involved translating the detected key signature into a number and appending the number for each clip to the feature vector, and additionally appending the calculated tempo.
Due to the high variability of our results without tempo and key signature information, we decided that adding the two additional features would have very little impact on our results, especially since there is no obvious reason why genre would be correlated with tempo or key signature. Due to limited time and computational power, we were not able to gather enough data to give numerical proof that these additions would have no impact. However, we did briefly attempt to use key signatures as a feature, and observed no substantial improvement or worsening of scores from their addition.
Our results for this were barely better than random chance, and were worse than random chance in some cases, especially for folk music. Although we received favorable results on some runs, these were balanced by bad results on others. We are attributing this to our training set being too small and too biased. We only chose 25 songs for each genre, and many of those songs were by the same artists, causing them to be similar. Our samples were too small to extract an accurate model of the three genres, so our results were overly dependent on the randomness of the k-means cluster initialization and were mostly unreliable.
The next step for this code in the future would involve adding more to the training set. We would need to have enough data that the k-means algorithm could reliably converge to the same clusters despite different initialization points. This would likely lower the variance of our accuracy rates, so we can use that variance as a metric to determine uniformity of convergence. Once we have reliable results from the bag-of-tones method alone, we can more robustly explore adding results of other algorithms to the MFCC outputs.