Processing Algorithm

Chords are made up of notes. Therefore, the team started by detecting individual notes.

We created a script which interpreted WAV files into spectrograms: graphical plots of frequencies' strength over time. The spectrogram was computed with use of a Hamming window to reduce harmonics at the window's edge.

Spectrogram Resolution

We experienced issues with poor frequency-resolution: frequency bands were encompassing multiple pitches, which then became impossible to distinguish from one another. Lower frequencies were especially susceptible to this issue because, due to the quadratic spacing of notes in the frequency spectrum, lower notes are closer together.

We initially considered increasing the sampling frequency of the audio files to resolve this issue. However, increasing sampling rate only increases the maximum detectable frequency. Instead, we zero-padded each spectrogram window to increase the number of DFT coefficients, which lead to improved frequency resolution. Through experimentation, we observed that where the zeros were added (start vs. end vs. middle) didn't affect the DFT in any noticeable manner. Frequency resolution was then 'traded-off' for temporal resolution as needed by shortening the spectrogram window length in time.

Noise Gating

To ensure quiet notes are detected as effectively as loud notes, the spectrogram was passed through a "noise gate". This filter calculates the volume of the signal and, if the volume is too low, cuts off the signal entirely. Signal volume was calculated by summing the squares of all DFT coefficients for each time window, in accordance with Parseval's Theorem (Eq. 1). Volume was normalized across the entire WAV sample to allow for consistent filtering across different recordings. DFTs which passed the noise gate filter were then normalized so that quiet and loud notes appeared with similar prominance in the spectrogram. Figure 1 shows a plot of volume-over-time for "Twinkle Twinkle Little Star".

Eq. 1: Parseval's Theorem.

Figure 1: Volume over time for "Twinkle Twinkle Little Star". Each peak corresponds to a single note for this simple recording.

Low-Frequency Amplification

Due to the fact that sound waves work by compressing air, the power density of a sound wave is proportional to the frequency of the wave. This meant that low-frequency notes appeared less intensely than high-frequency notes in the spectrogram. Consideration was given to correcting for this phenomenon by dividing each DFT element by its frequency. This, however, resulted in very-low frequency noise (such as wind) becoming extremely loud. Therefore the team instead boosted all frequencies below a certain frequency by a constant factor. This method was adequate for detecting low-frequency notes, but could be an area for future development to focus.

Harmonic Pitch Rejection

Harmonic notes appear at integer multiples of a fundamental frequency, meaning they are not 'real', or 'played' notes. Such harmonics created issues for our note detection algorithm. In many cases, harmonics appeared more intensely in the spectrogram than actual (played) notes. Simply thresholding was therefore inadequate for filtering harmonic notes from the spectrum.

To identify harmonic notes, integer multiples of every detected note were considered. If a note existed at an integer multiple of another note's frequency, the note was considered a potential harmonic. Harmonics of varying order could then be attenuated by varying amounts (0-100%) in order to reduce any issues they might introduce to the note detection algorithm. This was effectively a "notch filter" which creates notches based on detected notes, as well as the orders of harmonics the user wants to attenuate.

The below figures demonstrate the harmonic rejection filter. On top is "Twinkle Twinkle Little Star", and on the bottom are the identified second-harmonics. Regions in the spectrogram are only detected as harmonics if they are loud enough to pass the final threshold (next section). This consideration was made in to prevent redundant operations in the interest of computation time.

Because 'quiet' 2D-regions of harmonics are ignored, detected harmonics appear 'smaller' in the lower plot and the relatively-quiet second-harmonics at (t=10, f=900) and (t=11, f=900) are not detected.

Thresholding

A simple threshold was then applied to the spectrogram in order to identify prominent frequencies. The figure below demonstrates such a threshold being applied to "Twinkle Twinkle Little Star". Prominent frequencies are set to a value of "1", and all other frequencies are set to a value of "0". Prominent frequencies are then classified as notes by selecting the equal-temperament tuning note which the frequencies lie closest to.

With foundational note detection working sufficiently well, the team focused their efforts into real-time processing and chord detection.

Addition of Real-Time Processing

To detect notes in real-time, audio from a microphone needed to be processed in short 'buffers'. We were able to adapt our existing note detection algorithm to these short buffers very easily.

Software Structure And Performance Considerations

To implement real time processing we used the opensource "pyaudio" Python library to interface with the user's microphone. Our algorithm runs parallel processing to capture audio data and execute the processing algorithm simultaneously. To enhance performance and standardize our processing function, we implemented an internal audio signal object. We can quickly (cheap compute) convert WAV (uncompressed audio file) or raw microphone buffer data to our internal type then run our processing algorithm in real-time or playback processing modes. Our real-time loop can be set to a timeout, or run until keyboard interrupt.

Real time processing captures audio in a user configurable memory buffer. A smaller memory buffer results in more frequent output from the program and higher compute overhead. This is due to the overhead associated with running pre-processing, calling our note detection algorithm, and launching a parallel thread. The minimum usable buffer size is dependent upon the compute power available. By default, the buffer is large enough to store 2 seconds of audio data. If buffer overflow or underflow occurs, the program throws an exception and exits.

Noise Filter

When processing entire WAV files, the program normalizes the volume (energy) of the entire signal (i.e. between 0-1). Silence or white noise can be distinguished from actual data by setting a threshold (effectively comparing every detected note to some fraction of the loudest note). Real-time processing presented a challenge with filtering noise especially when the buffer size was decreased because there is a possibility that the entire buffer is relatively silent. If a full buffer was strictly white noise we would not be able to detect this after volume normalization. This would result in insignificant noises being considered 'loud', and thus detected as notes.

Our solution was to implement volume memory. The program can remember the energy of the loudest note detected thus far and use that to filter white noise. If the filter finds 3 or more orders of magnitude difference relative to the loudest note, the note being analyzed gets filtered out. We created a custom "NoiseFilter" python class to implement this feature. The class has several user configurable modes that are best suited to signals with different volume profiles (i.e. how the energy of the signal changes with time). These are discussed below:

Dynamic Noise Filter: By default the NoiseFilter is "Dynamic" meaning it has short term memory. The filter only uses information from the last 3 buffers (this is also user configurable) with information older than this being lost. Dynamic mode is the most versatile and can adjust from periods of loud notes to periods with much quieter audio.
Linear Noise Filter: This mode is best suited for audio signals with very loud and completely silent notes with no in between. For example, if the user is running real time processing in a musical show, there are loud periods when music is playing and silence when music is not playing. Linear filters have permanent memory, meaning they will filter out silent portions in the signal regardless of how long (time) the silence periods are.
Attenuated Noise Filter: This mode combines linear and dynamic filtering. The loudest note is remembered and the filter has permanent memory but older notes "age" by an exponential attenuation. For example, a loud note detected several buffers ago may only appear as 1/3 the energy of what it originally was. In this way, new notes appear louder than old notes. This mode is best suited to audio signals with volume profiles that change gradually over time. Dynamic mode is better suited to faster changing audio signals.
No Noise Filter: For situations where the entire real-time audio snippet will be relatively loud, the user should not run a noise filter to avoid the compute overhead.

Chord Interpretation

With note detection and real-time processing sorted, we developed chord detection capabilities. Note lists from our note detection algoirthm were interfaced with an existing open-source chord classification library in order to detect chords from music.

Pychord Implementation

We use the open-source python library called Pychord. A function in the library, find_chords_in_notes, takes in a list of notes and outputs the chord those notes constitute, if any. The list of notes our detection algorithm produces was cleaned to the library's specifications before being sent to the find_chords_in_notes function. The figures below demonstrate differences between the output of our note detection algorithm and the inputs to the PyChord library:

Output from note detection algorithm

Cleaned data, as sent to PyChord

Integrating Chord Detection with Real-Time Processing

With real-time processing and chord detection capabilities completed, the two were merged together to enable real-time chord detection. Our program was given two modes: reading from an existing WAV file, and listening to the user's microphone.

With the program itself covered, it's time to cover the RESULTS!

Page updated

Report abuse