For our second method of key signature detection, we decided to make use of the cross correlation function to identify the quantitative presence of each note in a sample input signal. The cross correlation function yields a large magnitude where the signals have overlap in value and low values when they are different. Instead of isolating frequency components, we used MATLAB to develop an algorithm that could evaluate the presence of certain notes in a song.
Note Frequencies
We started our algorithm by identifying notes we wanted to detect. We chose 96 notes (C1-B8) that were the basis of most songs.
We put the frequencies of these notes into a 12 x 8 matrix to logically track later in our algorithm.
Once we had the note frequencies in a matrix, we used these frequencies to iteratively generate small sine functions at the respective frequencies of these notes to emulate a "sample note" which we can use for the cross correlation function. When we cross correlated these small note samples with an input audio signal, the output would tell us how present that note sample was in the song. We used this principle to modify the output of the cross correlation function.
From here, we introduce the idea of a "Significance Factor". This is a term I coined to describe a method of analytically evaluating a note's presence in a given song. Please refer to the MATLAB Code below for this section. We start by cross correlating a small sample note with the input frequency. We then take square the output of the cross correction in order to convert all values to positive numbers. I choose to divide the value obtained from this operation by 10,000 in order to get more reasonable numbers to work with. Lastly, I average the value of this function to obtain the "significance factor" of that note. We iterate through all 96 notes, and input the significance factors into a parallel matrix that logically corresponds to its respective note.
Suppose we take an 6 second long input signal of a sine wave at the same frequency as an F5 note. Please find in the figure below a plot of this note.
When we take the cross correlation of this signal with a sample F5 note from the algorithm, we get the following plot. As we can see, the value of this function is very large in magnitude as the input and sample note share the exact same frequency.
From this, we can compute the adjusted cross correlation function from above, which reduces the magnitude and squares the function to ensure positive values. You may find the plot below. Please note that the magnitude of the function is still very large, and we will use this to record a large significance factor.
However, when we try running this function against two notes of similar frequency or two notes separated by an octave, we get much lower magnitudes. This is what makes the cross correlation function so robust, as it only identifies notes that have the same frequency and greatly reduces in magnitude for notes not at that exact frequency. Please see below for examples.
Adjusted Cross Correlation between F6 and F5 input signal
Adjusted Cross Correlation between G5 and F5 input signal
Finally, we can find the significance factor from each note. Please find below a table depicting the significance factors for each note when ran against the F5 Input signal. As we can see, the F5 Significance Factor is much greater than all the other notes present. This is the principle we use to identify notes present in each input file.
Once we can get a significance factor for each note, we can start matching it to a key signature. A key signature can either include sharps or flats, so we developed a method to identify a key signature from the significance factors of notes using an array of 5 logical variables in an array.
[C/C#/Db, D/D#/Eb, F/F#/Gb, G/G#/Ab, A/A#/Bb]
Because there are 5 potential sharps/flats that we can detect, we used comparison booleans to form two arrays of 5 logical variables. one assuming flats in a key signature and one assuming sharps. For each of the 5 instances, we would evaluate if the natural note's presence were less than or greater than its sharp's case or flat's case. All naturals would return 0 at the position, sharps would return 1, and flats would return 2.
For example, lets look at F Major. F has 1 flat (B Flat) in the key signature. Using the above array key, we can expect an output of [0, 0, 0, 0, 2] as there is one B flat and the notes are natural. C Major would be [0, 0, 0, 0, 0], and D Major would be [1, 0, 1, 0, 0].
When the two logical arrays are calculated, the array with the most amount of sharps/flats are considered the primary key signature while the other is stored in a cache. The primary array is ran against a switch statement for all of the 12 key signatures and a key signature is displayed when a match is found. If no match is found, the other key signature array is made the primary array and the switch statement is repeated to secure an output. If no match is found, the program outputs "Unknown Key Signature" and results in a failure.
For our MFCC Algorithm, we change these outputs to integer outputs to correlate to a specific key signature. This way, we can append the output of this algorithm to the end of the feature vector and see if this assists the program in correctly classifying our 3 genres of music.
Right now, there are a few drawbacks to using this method over a frequency isolating method. Although this method proves to be very accurate, we must consider the trade-offs before deploying it on other projects.
The first drawback would be the massive amounts of computational power to resolve these key signatures. With a 30 second song, the algorithm will have to cross correlate 96 sample notes with a very long signal. This can cause the algorithm to be slow against larger input files.
The second drawback is that the algorithm is less accurate against input files with faster tempos. Because the sample notes generated are a a bit long, it is possible that the tempo of a given song is fast enough where the song's notes are smaller than the sample notes. This makes the significance factor notably smaller in value, as the resulting cross correlation value will be smaller as a result. We could technically optimize the algorithm for faster songs, but it would involve dropping the normalization factor down and making the sample notes smaller, which would affect the accuracy of slower songs.