The impulse response of a system is the output of the system when the input is an impulse (Pauly). We can simulate an impulse using a short signal with high amplitude, such as a clap. For example, the recorded sound of a clap in an empty room is the impulse response between the hands of the clapper and the microphone. The room and all its contents and properties is the system, and the clap is the short input signal. The impulse response is important, because it characterizes a linear time-invariant system (LTI system) and can be used to calculate the output given any input (Pauly). If you know the system input and the impulse response of the system, you can predict the output using convolution, which we will discuss in the next subsection.
In other words, if you want to know what Bohemian Rhapsody would sound like in a cave, you simply need to record the impulse response of a cave (recording the sound of a single clap in a cave would suffice) and convolve the song with the impulse response. Similarly, if you want to simulate the sound of Bohemian Rhapsody coming from a specific direction in the cave, you can take the impulse response of the system in response to an input coming from that specific direction. This is what we did, but without the cave.
Figure 1 depicts our measurement setup. We used two microphones positioned near the ears of a volunteer to collect data, and we did so in a quiet environment (a dorm room). We marked a circle with a radius of one meter on the ground, centered at the feet of the volunteer and her microphones. We gathered 18 sound samples from each microphone by taking the impulse response of the system (the room, objects, ambiance, and so on around the microphones) given an input from different places around the circle (one sound sample every 20 degrees). The microphones recorded the impulse response simultaneously, and our input was a clap.
There are a few notable things about our setup. First, we used two microphones from a stereo microphone unit. Second, they recorded simultaneously, and third, we had a volunteer's head in between the microphones instead of just placing them six inches apart (approximate width of a head). We did this to imitate a binaural recording, which is a method of recording sound the way human ears hear it (Pike). We want to create a realistic simulation, and since the user will be listening to our sound manipulation with their ears, we tried to make our microphone setup as similar to human ears as we could -- two microphones facing out and a little forward, approximately six inches apart, on either side of a head, and recording at the same time.
It is important to mention that using a volunteer's head instead of empty space is particularly significant. The shape, size, and density of a head affects how sound reaches the two ears on either side of it; this effect is known as head shadowing. Head shadowing is essential to sound localization, because it affects the ears differently depending on where the sound is coming from (Francis). Your brain knows how the shift in a sound's amplitude, timbre, and phase between the two ears correlates to the area around you -- if your left ear perceives the sound with a slightly greater amplitude than the right, your brain knows the sound is probably coming from your left, and so on (Francis).
We recorded the impulse responses in MATLAB. Figure 2 depicts one of the impulse responses we collected. The room was quiet, so the amplitude of the recording before and after the clap is low. The clap is represented as a sharp spike that peaks high but doesn't last long. The high amplitude and short duration makes the clap an ideal input signal for recording impulse response, because it very much resembles an impulse. All of the impulse responses we recorded are similar to the impulse response in Figure 2, because our input signal of clapping was approximately the same, although the amplitudes of their peaks vary.
As we mentioned a couple of paragraphs before, your brain uses the difference between the sounds your left and right ears perceive to judge the distance and angle from which the noise is coming (Francis). Using an impulse response from just one microphones is not enough for the listener to do sound localization -- we need to convolve a sound with the impulse responses from both the left and right microphones and play it back to the user via a double-channel audio track (different sound for each ear) so they can identify any disparities between the left and right soundtracks and use that to localize the sound.
All of which is to say the impulse responses for the left and right microphones are usually not the same. At 0 and 180 degrees, they could be, because there shouldn't be any difference in amplitude or phase between the left and right microphones when the sound is equidistant from them, but any other angle should yield two different impulse responses. The difference between them is key to localizing the sound. Figure 3 shows an overlaid plot of the left and right microphone impulse responses given an impulse one meter and 80 degrees from the volunteer (on their right side). The right impulse response peaks sooner and higher than the left impulse response, which makes sense because the sound is coming from the right side. This means the sound is closer to the right ear, and the right ear does not have anything between it and the sound (other than air). Conversely, the left ear is farther away and experiencing head shadowing, which explains its lower and later peak. Theoretically, these small differences should be enough for a listener to determine the angle the sound is coming from to be approximately 80 degrees to the right. This is why we used two microphones.
As we briefly touched upon, the two key differences between the left and right sounds the brain perceives are the amplitude and starting time (Francis). As we saw in Figure 4, the peak for the left microphone happened after the peak for the right microphone. The difference between the peaks was 21 samples, which, at a sampling rate of 44100 samples per second, is roughly .4 milliseconds. The difference in max amplitude was .3685. This means the right microphone perceived the impulse .4 milliseconds faster and at .3685 greater an amplitude than the left microphone. The time delay and difference in amplitude between the left and right microphones for the different impulse angles varies in a predictable way. At 0 and 180, the differences should be negligible, and at 90 and 270 degrees they should be most extreme. Plotting a circle with the absolute value of the difference in peak time or amplitude as the magnitude of the circle at that point yields an interesting looking graph that resembles two overlapping bubbles, one over where each ear would be (assuming the ears are somewhere around 90 and 270 degrees). Figure 4a depicts the difference between the amplitudes of the maximum values of the impulse responses as measured at the origin for various angles, and Figure 4b shows the difference between the indices of the maximum values of the left and right impulse responses, as measured at the origin for various angles. For both figures, the angle between the receiver and source is as depicted. Neither graph is perfectly centered at (0,0), which indicates a relative systematic error between the two microphones (i.e. the right microphone is slightly more sensitive and records sounds with a higher amplitude than the left microphone). However, the difference in peak time and amplitude at 0 and 180 degrees is very small relative to the difference in peak time and amplitude at 90 and 270 degrees, which makes sense and validates the relative accuracy of our microphones. The impulse responses we collected have peak time and amplitude differences that make sense; therefore, they should lend themselves to realistically simulating sound from a given direction by accurately rendering the differences in peak time and amplitude between the left and right ears. However, we cannot feasibly take the impulse response for every angle between 0 and 360, so we will have to interpolate the impulse responses for the angles we do not have based on the impulse responses we do have.
Interpolating the data is relatively straightforward. As we mentioned before, the key to localizing a sound is examining how the two ears perceive it with regards to relative amplitude and time (Francis). The difference in amplitude and starting time between the left and right microphones varies between impulse responses, as we showed in Figure 4, and if we know how it varies, we can interpolate those differences for the angles for which we do not have impulse responses. We decided to use a linear approximation between the values we know to estimate the values we do not. We do so by finding the relative difference in peak amplitude and time for each set of impulse responses we have and organizing them in a data matrix with three columns: angles, relative amplitude differences, and relative delay differences. The sign of the values indicates whether or not it is biased towards the left or right microphones (i.e. -5 samples indicates the left microphone is 5 samples ahead of the right microphone). We assume you know how linear interpolation works, so we will not discuss the mechanics of it further here, but we will do a quick example.
We have 18 sets of impulse responses recorded, one set every 20 degrees. If the given angle is 25 degrees, we take the values from the two closest impulse responses, 20 and 40 degrees, and find the slope between them. The difference in peak amplitude for the impulse responses at 20 degrees is .0638 and .303 for the impulse responses at 40 degrees. Therefore, the slope between them is (.303 - .0638)/20 = .012. The slope for the difference in peak times is (21 - 14)/20 = .14. The difference between 25 and 20 is 5, so the difference in the amplitude difference is .012 * 5 = .06. Now, we take the impulse responses for 20 degrees and multiply them by 1 + .06 = 1.06 and move the left impulse response forward by .14 * 5 = .7 (rounded to 1), and we have linearly approximated the impulse responses for an angle for which we had no data.
Now, we just need to convolve an input signal with our impulse responses to produce the simulation.