Song Key Detection Algorithm (Signal Processing)

Strategy 1 - At each time sample maximum note is chosen. make a zero array and fill the maximum note in the relevant index with respect to the Q transform array. Then sum it up in time's direction. Now merge all the scales together to obtain a single scale. From this sort this 12 element long array. A scale has 12 notes. By removing the lowest occurred 5 values will give you the scale. y considering the notes of that scale we can obtain the key.

Applying Constant Q Transform

import numpy as np

import librosa

import librosa.display

import matplotlib.pyplot as plt

y, sr = librosa.load("/content/shape-of-you-official-music-video.wav")

C = np.abs(librosa.cqt(y, sr=sr))

fig, ax = plt.subplots()

img = librosa.display.specshow(librosa.amplitude_to_db(C, ref=np.max),

sr=sr, x_axis='time', y_axis='cqt_note', ax=ax)

ax.set_title('Constant-Q power spectrum')

fig.colorbar(img, ax=ax, format="%+2.0f dB")

Making the maximum occurring note matrix

maxind = np.argmax(C,0)

print(C[-1][maxind])

array = np.zeros(C.shape)

for i in range(0,len(maxind)-1):

ind = maxind[i]

array[ind][i]=1#C[ind][i]

print(maxind.shape)

print(array.shape)

print(max(array[1]))

print(max(maxind))

fig, ax = plt.subplots()

img = librosa.display.specshow(librosa.amplitude_to_db(array, ref=np.max),

sr=sr, x_axis='time', y_axis='cqt_note', ax=ax)

ax.set_title('Constant-Q power spectrum')

fig.colorbar(img, ax=ax, format="%+2.0f dB")

Merging all the scales in to a single scale

x = np.sum(array,1)

keys=np.reshape(x,(7,-1))

chords = sum(keys,1)

print(chords.shape)

index = np.linspace(1,12,12)

D = np.stack((index,chords),1)

D=D[D[:,1].argsort()]

print(chords.shape)

plt.plot(index,chords)

plt.show()

print(D)

Lets compare results

By considering the above numbering pattern notes detected by our algorithm is

F#,A,C#,B,E,G#,G But G is an outlier. Instead D# should be there.

Notes = D[5:12,0] #removing the least used 5 notes

print(Notes,'\n')

scales = [[1,3,5,6,8,10,12],

[2,4,6,7,9,11,1],

[3,5,7,8,10,12,2],

[4,6,8,9,11,1,3],

[5,7,9,10,12,2,4],

[6,8,10,11,1,3,5],

[7,9,11,12,2,4,6],

[8,10,12,1,3,5,7],

[9,11,1,2,4,6,8],

[10,12,2,3,5,7,9],

[11,1,3,4,6,8,10],

[12,2,4,5,7,9,11]]

matches = []

for i in range(12): #finding the similar scale to the algorithmically obtained scale

Notes_as_set = set(Notes)

intersection = Notes_as_set.intersection(scales[i])

NotesIntersectionScales = list(intersection)

matches.append(len(NotesIntersectionScales)) #number of matches

print(NotesIntersectionScales)

keys = ['C','C#','D','D#','E','F','F#','G','G#','A','A#','B']

KEY_of_the_SONG = keys[np.argmax(matches,0)]

print(matches,'\n')

print(KEY_of_the_SONG)

It finds the most played notes. Compare it with all the scales. Scale with highest number of intersections is the winner!!

most of the songs contain too much percussion which doesn't support any pitch. I tried a filter to remove lower parts. ut the problenm with percussion is it captures the entire spectrum. Cannot get rid of that.

#filter

scales = 4

print(C.shape)

#C=C[12*(7-scales) :] #highpass

C=C[: 12*(scales)] #lowpass

print(C.shape)

Sometimes the correct key might be matched as the second best approximation because of the above mentioned nonlinearity. Below piece of code is to fing the second best.

next=np.delete(matches,np.argmax(matches,0))

keySECOND = np.delete(keys,np.argmax(matches,0))

nextBestKEY = keySECOND[np.where(next==matches[np.argmax(matches)] )]

#print(next)

print("josh pan,Dylan Brady = ",KEY_of_the_SONG)

print("other key suggestions - " ,nextBestKEY)

There's lot of nonlinearity in music when it comes to analyzing from a engineering POV. But I have a better idea to take this to a new level. The issue with most songs is that it contains non harmonic instruments. For example drums. Which doesn't produce a pitch in a scale but it captures the entire frequency spectrum. May be once we figured out audio source separation then we might be able to separate the harmonic instruments and detect the key correctly.

New Approach

Earlier we considered constant Q transforms. But in this srategy we try chroma transforms.

import numpy as np

import librosa

import librosa.display

import matplotlib.pyplot as plt

path = "/content/drive/Shareddrives/G-33-2022/Audios/Fully mixed song for us to separate and listen/Billie Eilish - bury a friend.wav"

name ="Billie Eilish - bury a friend"

y, sr = librosa.load(path)

#librosa.feature.melspectrogram(y=y, sr=sr)

S = np.abs(librosa.stft(y, n_fft=4096))**2

chroma = librosa.feature.chroma_stft(S=S, sr=sr)

fig, ax = plt.subplots(nrows=2, sharex=True)

img = librosa.display.specshow(librosa.amplitude_to_db(S, ref=np.max),

y_axis='log', x_axis='time', ax=ax[0])

fig.colorbar(img, ax=[ax[0]])

ax[0].label_outer()

img = librosa.display.specshow(chroma, y_axis='chroma', x_axis='time', ax=ax[1])

fig.colorbar(img, ax=[ax[1]])

Then the code changes as follows because we don't have to divide 84 notes in to 7 scales anymore as it is done by the inbuilt function.

x = np.sum( chroma,1)

index = np.linspace(1,12,12)

D = np.stack((index,x),1)

D=D[D[:,1].argsort()]

plt.plot(index,x)

plt.show()

What if there are many keys matching to one song

Let's consider Billi Eilish 'bury a friend' song. The code below does the usual process where it matches the most played notes with the standard scales.

Notes = D[5:12,0] #removing the least used 5 notes

print(Notes,'\n')

scales = [[1,3,5,6,8,10,12],

[2,4,6,7,9,11,1],

[3,5,7,8,10,12,2],

[4,6,8,9,11,1,3],

[5,7,9,10,12,2,4],

[6,8,10,11,1,3,5],

[7,9,11,12,2,4,6],

[8,10,12,1,3,5,7],

[9,11,1,2,4,6,8],

[10,12,2,3,5,7,9],

[11,1,3,4,6,8,10],

[12,2,4,5,7,9,11]]

matches = []

for i in range(12): #finding the similar scale to the algorithmically obtained scale

Notes_as_set = set(Notes)

intersection = Notes_as_set.intersection(scales[i])

NotesIntersectionScales = list(intersection)

matches.append(len(NotesIntersectionScales)) #number of matches

print(NotesIntersectionScales)

keys = ['C','C#','D','D#','E','F','F#','G','G#','A','A#','B']

KEY_of_the_SONG = keys[np.argmax(matches,0)]

max_value = max(matches)

max_index= np.where(np.array(matches)==max_value)

print('\n',matches,'\n')

print(name," = ",KEY_of_the_SONG)

"Billie Eilish bury a Friend" Detected most played notes

How 12 scales match to the song. Now to consider the best match we consider the number of elements in this intersections because we assume that the most number of notes matched with the computer detected notes (the figure above) is the scale of the song.

But you can see that there are 4 scales with 5 elements. To choose between these scales we implement a weighting system to calculate a score for these 4 scales to choose the best scale.

Detected most played notes are ordered from least played to most played. We give a higher weight to the most played notes and a lower weight to the least played ones and calculate a score for each scale , detected notes intersection. We choose the scale with the highest score.

Minor key detection - As there's always a relative major to all minor key we cant specifially identify whether it is a minor / major. But luckily for us most of the times pure minor scales are not use. What are in use is melodic and harmonic minor scales. Hence by adding these new different scales and naming them as minor is the best option we can go for.

Page updated

Report abuse