Music &Machine Learning

Work in progress, but I'm hoping to make scripts public as soon as I think they will be useful for others. I may also upload the feature extraction data.

My github repository for this project:


So I’ve been doing this as a side project (by day I’m an astronomer), and I just wanted to play around with it for fun and see what (if anything) came out. After it turned out that it is actually working and there is interest from a number of you folks out there, I want to put the initial code and some results online in the name of open science. At the very least I hope this little piece of work inspires you in some way (whether that is with music, machine learning, programming, or research in general). If you have cool ideas feel free to get in touch! It’s all about sharing knowledge. I’ve got many ideas for directions to take this in. A side note, I'm also working with some masters students on related projects, use machine learning to predict what music should be used for web or TV advertisements, but for now I will just focus on my work, and once they have got results I will definitely be adding their finding here and crediting them.

A more involved introduction:

Isn’t it incredible how music stimulates our brains, as well as intense discussions? We all have specific genres or artists we have long-term attachments to. Some people declare their love for specific genres and stick listening to what they are familiar with, or what’s in the ‘popular domain’. Whilst others will have much broader musical tastes, declaring that there simply isn’t enough time in the day to explore all the music out there.

As humans, our interpretation of music is incredibly biased on many levels. As a simple example, we can enjoy a band more simply because we like their personality traits, appearance, or even their band name. That’s fine, inspiration is a great thing, but in this study the point is to filter that out and explore the music and the music only.

I’m doing this simply because I’m a bit obsessed with all things audio related. But I’m also fascinated by just how biased we are as humans, and I want to explore ways to present data that make us put more thought into ourselves and the world. The last two sentences could be applied to a whole host of hard-line topics, but in this rather light-hearted case, specifically for music!

My goal is to use machine learning (basically maths and statistics), to quantify different sounding songs or genres of music. I want to do this solely from processing the song, using no user listening data, no artist information and no meta-data.

So, what is a genre of music? Pop? Rock? Opera? Classical? Dance? Hiphop? Instrumental? Can we hope to quantify sub-genres like heavy metal or post-rock bla bla?

If you detach yourself from your musical biases just for a second, it doesn’t take much thought to realise that pop (e.g. Britney Speares) and rock (e.g. Foo Fighters) are actually very similar. Singers, typical band instruments, 3-4 minutes long, similar song structures etc. On the other hand, whilst I can wrap up so many sub-genres into ‘classical music’, hopefully you can see that it is totally different sounding music altogether.

So how far can we go with this in terms of getting an algorithm to tell the different between musical genres? From the standpoint that most popular music has very similar features (but vastly different audiences), it will probably be much harder to get a computer to tell between them. But, as you will see, this is much easier for distinctly different music like classical. Any step towards quantifying what we, as humans, think is ‘good music’ is fascinating regardless.


I’m going to take a handful of artists (biased by the fact that I have purchased their CDs), and extract what I call ‘features’ from their songs. I have picked 55 features to start with. Later I will remove (prune away) some of them which are not useful when I do supervised learning, but for now they are all used in the following clustering example. For example, what is the beats per minute (BPM)? What is the spectral bandwidth? Spectral centroid? Zero crossing rate (ZCR)? RMS energy, Standard Deviation, Skew and Kurtosis in specific frequency bands? (I split the song into 5 frequency bands from 30 Hz to 10KHz on a log scale). I also split the song into harmonic and percussive components (check out how here) which proves useful for quantifying if a track is drum/melody heavy. In general I hope my feature names are somewhat informative, but they mostly come from this Librosa page, so do check it out if you want to explore them (or look at my code or data files on Github). Most of these features average the song in time, but in another more recent iteration of this code, a summer student worked with me to developed loads more features, particularly analysing variations in the music on different time-scales (will try and get this code up on git soon also!).

Initial results

Now I have a list of numbers quantifying different things about the song, I turn to machine learning. The simplest thing to do is unsupervised clustering. This means we don’t tell the computer to look for anything in particular (unsupervised), and we just get it to cluster songs together that have similar numbers relating to their features. In an overly simplified example, it could put all songs with a BPM around 150 close together, and far away are a bunch of songs with a BPM of around 60. But with many features it can decide which are more important, weight those more, and the combination becomes extremely powerful.

Once I have features extracted and suitably organised from a data management point of view, I tried some unsupervised clustering. Unfortunately, humans cannot visualise data sets in higher than 3 dimensions, so I need to display this 55 dimensional data set in 3 dimensions, and ideally 2 (so far 3D plots don't give a much better perspective unless you spend ages rotating them to get a feel for the 3D space, so 2D is much easier to start off with). I achieve this by using Singular Value Decomposition (SVD). The Scikit-learn documentation explains this nice and concisely: "This transformer performs linear dimensionality reduction by means of truncated singular value decomposition (SVD)". Basically it collapses the 55 dimensions down into 2 dimensions, maximising the variance in the data. Our 55 features have been reduced to 2 features, where these 2 features are a linear combination of the initial 55 features. Plotting the resulting two dimensional data set gives this:

Clicking on the plot will show you an interactive html version, with a few more artists. Remember that when the SVD collapses the data down into 2 dimensions, it does not mean that songs overlapping each other are the same. Because in 3 or 4 or 55 dimensions the data points may not overlap. Later on this will become clearer when I train a computer to recognise different artists, which it can do with surprising accuracy, despite the above plot looking a bit cramped in places! I will try and explain this better in the coming sections.

t-SNE: going non-linear

The above plot is a little cramped if you viewed the html version on Github. So, what happens if we use a non-linear clustering algorithm? It has been widely shown that t-distributed stochastic neighbor embedding (t-SNE) can reveal structure where other clustering algorithms cannot. It effectively maintains a much higher variance in the data set when you reduce high dimensional data sets down to 2. Repeating the steps above we get the plot below (do play with the interactive versions!), which gives a much more effective separation of the artists in 2 dimensions, as compared to SVD.

Apologies for the changing colour schemes as I'm doing this all in an automated fashion without manually settings colours! Depending which artists are included the cluster plot will look different, the best way to explore is to toggle artists by clicking on the legend :)

The Flashbulb: single artist study

It has become easier to test t-SNEs on an artist I am very familiar with, because, at the end of the day I need to know if the computer has clustered songs sensibly or not. So, let's take a second to delve into The Flashbulb's music. I've taken all 43 (!) of his albums (560 songs!), and tried adjusting parameters in the t-SNE to get the best results (having one artist make this much music is an awesome data set). Specifically I am changing the learning rate, controlling how quickly it groups similar songs, and another hyper-parameter called 'Perplexity', which controls how to cluster small groups with large groups (or how many nearest neighbours each point is expected to have). This is a great example of how to understand t-SNE a bit more.

As the docs state, t-SNE has a cost function that is not convex, i.e. with different initialisations we can get different results every time we run it (shown by the two plots below). I'm trying to get the t-SNE to group distinctly different types of song, I want to avoid low learning rates which would cluster all songs too close to each other, and high learning rates which create too many distinct islands and eventually fail to cluster anything. Getting the learning rate (and perplexity) just right will enable it to distinguish between broad genres changes, but does not create isolated islands, such that we see a more gradual transition from his solo piano compositions to his glitchy drum compositions. An understanding of his music will probably help you interpret it more, but I will do this for a bunch of artists later on (some are on my Github page already).

In the plot below, the bottom right is his space music, piano albums, and ambient works. Then moving from there along to the left and upwards contains most of the songs on his major album releases such as "Nothing Is Real", "Opus At The End Of Everything". From there we quickly transitions back to the right where a large clump forms containing glitchier drum albums such as 'Hardscrabble", which still contain a lot of melody. Finally, moving up to the left we go towards his most glitchy and abstract releases under "Acidwolf", "Dr. Lefty" and "Human Action Network", and the bulk of tracks from albums such as "Red Extensions Of Me" and "Resent And The April Sunshine Shed". It is very interesting to note albums such as "Soundtrack To A Vacant Life", which contains songs covering much of this entire space, perhaps contributing to the large appeal that album had nearer the start of The Flashbulb's career (who knows?). Also note the spread in the style of Albums like "Arboreal", and his most recent release "Piety Of Ashes". This is probably starting to make less sense if you don't know the Flashbulb, but if you do, I recommend you click on the plot and explore the interactive version, toggling albums in the legend is the best way to explore, it's fascinating! Note that I varied the Perplexity parameter in the t-SNE to achieve the two different shaped plots - the one on the right is basically the same but without the bends! Note - I had trouble selecting 43 distinguishable colours... so they repeat... toggling them on/off in the legend is the best way to explore where individual albums sit.

You can check out similar plots for Avril Lavigne (run1, run2), where I did two runs showing the non-convex cost function giving different looking plots. Still, you get the same result (if you're familiar with Avril's songs), just displayed differently. Obviously she has less music, and generally sticks to a specific genre, so the diversity in the plot is less so than for The Flashbulb.

Quantifying similar artists

So let's take our original 55 dimensional data set, not reducing the dimensionality this time, and see what we can understand from it. In the above plot you might find it easy to visualise that classical music is far away from Taylor Swifts music, because the songs are literally far apart in the plot. We could quantify this by reading off the distance on the axis. But we can do much much better in the original 55 dimensional space. In 55 dimensions, each song has its own location, so a distance can be measured to every other song. To start off with I average each artist over all their songs, so now I'm looking at artists overall (in the above plot the blue points would be replaced by one dot in the centre of that cluster, for example). Now I can see which artists are similar by seeing how far they are away from each other in my 55 dimensional space. I get a matrix where each artists is compared against every other artist. If they are similar the pairs score is close to zero (or exactly zero if it's the same artist - like along the diagonal), if the artists are very different, the score is high. The plot is symmetrical about the diagonal black line. There is a lot of bias in this plot because I'm just using whatever artists that I have at least a few albums of. Whilst it can clearly show that, on average, Pendulum is different to Pink Floyd, we are going to have to do some more clever things to get it to distinguish between Taylor Swift and Disturbed. However, note that for example Jamiroquai does have a better similarity score to Taylor than Disturbed has, so it's generally pretty good!

Supervised Learning

Here I split the data up into two chunks, one to train a computer with (60% in all the following plots), and one to test it on afterwards (40%). I'll use a random forest, and prune away some of the less useful features before using the remaining features for the classification. A lot of artists don't have much data (songs) to train on, so we'll see them having poorer scores.

A simple example is only comparing Classical music with Taylor Swift. Can a computer tell the different? Yes, very easily. Even when only training on half the data we get an almost perfect prediction. (Support is how many songs we used to test on in the prediction, so support/0.4=total songs, since we're testing on 40% of the data).

Let's make it harder, can it tell the difference between Taylor Swift and Avril Lavigne? Yes, but not so perfectly. In fact our precision and recall is very good (much better than random) considering how similar their music is (female singers, pop music, etc). The feature pruning made the accuracy a little worse, but not by anything significant.

OK so now I'll throw a whole bunch of artists into the picture. Overall the precision/recall is very good considering the opportunities to confuse artists with each other.

There are many important points to make here. Classical music and The Flashbulb perform the best, as they both have very large training sets compared to the other artists. Classical music is very musically distinct to most other artists in this run (at least our features are very sensitive to this), and we have a large training set, both making for high precision and recall scores. The Flashbulb, however, writes music over numerous genres, but because we have such a large training set we get high precision and recall scores - in fact, The Flashbulb's precision score suffers because we are accidentally classifying other artists music as The Flashbulb, which makes sense given the wide range of genres in his music. If a song really is by the Flashbulb, we have a 95% recall rate (we don't confuse his music with others).

In general, my features are not sensitive enough to achieve scores above 90% for pop music. So next up I'm exploring alternative ways to extract features, and more effectively train a computer on artists who haven't written so much music. For example, I'm looking at splitting songs up into mini songs, to bulk up the training set, and also exploring more complex feature extraction. Ultimately, I'm working on a simple vocal recognition algorithm, which should massively improve the scores for artists who have a main singer.

Now let's throw in Boards of Canada

We all love Boards of Canada. They've got a fair bit of music to train on (144 songs - more than most popular music artists, but not as much as The Flashbulb or Classical). However, it generally does not perform so well in my set up. Booooo. If you know Boards you can probably already guess why it doesn't perform well in this comparison - it's confusing their music a lot with Classical and The Flashbulb (hence their recall score is particularly bad). It has also now dragged down the precision of Clasical and The Flashbulb, given that BoC songs are now sometimes misclassified as The Flashbulb.

Xmas Song Classification!

It's almost Xmas, so why not explore some Xmas music. Turns out it's fairly difficult, but performs better than expected given the diversity of Xmas songs out there. Despite a library of 300 Christmas songs which I have amassed, having such a good training set doesn't help massively. We are confusing Xmas music with Classical, and confusing Pink Floyd and REM with Xmas music a lot of the time.... :/ Still, it works fairly well, and much better than artists with small training sets!

The feature importances plot shows the 23 features left after pruning, used in the example above with the Xmas music. In particular, the average RMS energy, its standard deviation, skew and kurtosis in specific frequency bands has been removed (compared to the plot I showed earlier) as in this case that information is either not as useful or already captured in other features (such as those parameters over the whole frequency axis).

Feature analysis gets very complex, and requires a dedicated section to write about it coming in the future. In general we want to use as many features as are useful. But we don't want to start using hundreds or thousands of features as this can hinder a classifier (or it simply takes too long to run). I'm working on implementing a nifty feature selection methods which a team in our department at Jodrell Bank use for classifying Pulsar signals in surveys. But for now, this basic feature pruning is adequate, and in general only subtly affects the accuracy/precision/recall scores.

Another question we might ask is how reliable these numbers are? When we compare different sets of artists we do get different results. I've discussed how adding in artists which are too similar can affect the scores, but take a second to look at Zero7's scores with/without BoC and Xmas music. In general Zero7's scores are really poor, we just don't have enough of their music to train on. But note that in the example with BoC, their scores were particularly bad, despite absolutely no confusing with BoC's music. This indicates that our model for Zero7 is extremely tentative, and since we're talking in a statistical sense here, we need to look at things like cross-validation and the log-loss metric to show that Zero7's scores are not reliable, even when they may seem OK at first glance. I'll walk through cross-validation and log-loss metrics in another section. But you can basically see that for artists with small training sets their classification is not as reliable.

Overall, my basic features give really good predictions for artists! Hopefully by spending some time investigating the features more I can improve this further. By implementing some advanced neural networks on spectrograms of the songs themselves (TensorFlow, CNTK), for example through Keras, it's likely I will get much better results.

More cool stuff coming soon!