The beat detection algorithm used is illustrated by the figure below. Each step of the algorithm will be explained in more detail.
Sound Input
The Matlab code takes a 5-second sample of the song being analyzed. An example of the sample in the time domain is shown in the figure to the right.
Frequency Filterbank
First, the signal is split into six separate signals depending on its frequency. This is meant to isolate the different groups of instruments in the signal and analyze them separately. This is done by taking the FFT of the signal and breaking the signal into the bands 0-200 Hz, 200-400 Hz, 400-800 Hz, 800-1600 Hz, 1600-3200 Hz, and 3200-sampling frequency (these band widths are the paper Sheirer, 1998 referenced in the sources tab). The time domain representation of these signals, obtained by using an inverse FFT, is passed next to the envelope extractor step.
Envelope Extractor
Next, each of the six signals split in the previous step is full-wave rectified to reduce the high-frequency content and then they are convolved with the right side of a Hamming window, which is done by transforming the signals to the frequency domain, multiplying, and then inverse transforming back into the time domain. This step finds the envelope of the signal, from which sudden changes in sound can be identified more easily.
Above is an example figure showing the envelopes of the six separate frequency bands.
Above is an example figure showing the envelope for just the 0-200 Hz frequency band of the signal.
Differentiator & Half Wave Rectifier
The enveloped form of the signal is now differentiated and half-wave rectified. By differentiating the signal, changes in sound amplitude are emphasized and by half-wave rectifying, only increases in sound can be seen. A beat can be thought of as a periodic emphasis of sound and so large changes in the sound should correspond to beats.
Above is the 0-200 Hz frequency band of the signal after it is differentiated and half-wave rectified.
Resonant Filterbank
This step is done by convolving comb filters with the split signals. Basically, a comb filter is a series of impulse that repeat at a specified tempo. The signals and comb filters are convolved and the result will have a higher energy if the tempo of that comb filter is closer to the tempo of the song inputed. We specify the tempos of the comb filters and then transform the signals and filters into the frequency domain and multiply them.
Energy Sum
The energy is found for each of the convolutions calculated in the previous step. For each time comb tempo, the energies across the six frequency bands is summed.
Peak-Picking
The maximum energy sum calculated is chosen to be the fundamental tempo of the song.
The figure above is the result of the time comb filter convolution for the example song. The algorithm-detected peak tempo is around 100 bpm, which is also the human-detected tempo of the song.