MUMT 618

Computational Modeling of Musical Acoustical Systems

Final Project Report on

An Auralization Framework for Virtual Acoustics:

Structural Model for Binaural Sound Synthesis,

Binauralization of Room Impulse Responses


Submitted by: Ajin Tom

Course supervisor: Prof. Dr. Gary Scavone


Music Technology Area , Department of Music Research

Schulich School of Music, McGill University, Montreal, Canada

Fall 2018

Introduction

With the advent of VR/AR technologies, binaural sound synthesis has become a powerful tool for headphone-based presentation of virtual acoustic environments (VAEs). It can be applied for auralization purposes in various areas like audio engineering, telecommunication, or architectural acoustics. For many of these applications, a plausible presentation is sufficient; an authenticity reproduction of the sound field is often not pursued, thanks to human hearing perception. With binaural audio, consumers can experience multi-channel surround sound / 3D sound over headphones and still perceive the intended space. Hence they will not need a dedicated loudspeaker setup, neither will they disturb those present in the same physical environment.

Binauralization can greatly enhance auditory interfaces to computers, improve the sense of presence for virtual reality simulations, and add excitement to computer games. Another niche market is the use of binaural rendering for mastering and monitoring with headphones. Professional audio engineers are trained to create mixes and perform mixing/mastering process on loudspeakers as well as headphones. Switching between reproduction environments requires the availability of a multitude of loudspeaker setups, which can be costly and space consuming. Additionally, amateur and semi-professional audio engineers are increasingly benefiting from virtual audio reproduction techniques.

In the context of headphone listening, 3D localization can be achieved using head-related transfer function (HRTF) and head-related impulse responses (HRIR), which can either be measured or modelled. Physical effects of the diffraction of sound waves by the human torso, shoulders, head and outer ears (pinnae) modify the spectrum of the sound that reaches the ear drums. These changes are captured by the head- related transfer function (HRTF), which not only varies in a complex way with azimuth, elevation, distance, and frequency, but also varies significantly from person to person. Though impulse response based convolution using these measurements are perceptually more plausible, they are not computationally effective and cannot be adapted to carry out HRTF individualizations.

This project aims to build a structural model for binaural sound synthesis as proposed by Brown and Duda followed by investigating aurlaization and stereo-to-binaural upmixing techniques using time-domain (acoustics modelling) as well as frequency-domain (signal processing) approaches.


Objectives

  1. Analyse measured IR responses, build analytical binaural model, compare results to measurements
  2. Implement structural and IR-based model using filters, delay-lines, bi-linear transformations, convolutions, etc. in MATLAB
  3. Investigate scope for HRTF individualization by fine-tuning parameters of the analytical model
  4. Develop and implement binauralization of monaural impulse responses for room auralization using time-domain (acoustics modelling) techniques
  5. Develop and implement stereo-to-binaural upmixing using frequency-domain (signal processing) techniques
  6. Compile an STK class / JUCE (VST) Audio Plugin for real-time binauralization that accounts for listener positions shifts and head movements

Background Theory

618 hilarious PPT

Apart from Interaural Time DIfference (ITD) and Interaural Level Difference (ILD), there are three other cues that come into play while localizaing sound objects in space. Localization of a sound source in relation to azimuth and elevation are accomplished by accounting for reflections off of the shoulders and upper torso, the acoustical shadowing effect of the head, and the reflections due to the small ridges in the outer ear (called the pinna). These three effects can be modeled on the digital representation of the source signal.

Modeling the structural properties of the system pinna-head-torso gives us the possibility of applying continuous variation to the positions of sound sources and to the morphology of the listener. Much of the physical/geometric properties can be understood by careful analysis of the HRIRs, plotted as surfaces, functions of the variables time and azimuth, or time and elevation as seen in Fig.1.

The next sections describe each of the filters and delay-lines used to model the following sub-models in Fig.2 :


  1. Head-shadow filtering and ITD - direction and distance
  2. Shoulder echo and Torso effects - direction and elevation
  3. Pinna Reflections - direction and elevation
<click on underlined hyper-links of titles and concepts to view implementation code. The codes are well commented.>

Starting from the approximation of the head as a rigid sphere that diffracts a plane wave, the shadowing effect can be effectively approximated by a first-order continuous-time system, i.e., a pole-zero couple in the Laplace complex plane:

The pole-zero couple can be directly translated into a stable IIR digital filter by bilinear transformation:

The ITD is obtained through a first order all-pass filter whose group delay is the following function, (implemented using all-pass filter) :

Frequency Response of an ideal rigid sphere

Frequency Response of simple one-pole-zero spherical-head model

The shoulder and torso effects and synthesized in a single echo. This is implmemented using a fractional delay-line whose delay is propotional to the azimuth and angle of elevation. An approximate expression of the time delay is deduced by measurements as follows:

Elevation dependence of HRIR from measurements

Time delay modelled as single echo for different azimuths

Single echo implementation using fractional delay-lines

The pinna provides multiple reflections that can be obtained by means of a tapped delay line. In frequency domainm these short echoes translate into notches whose position is elevation depedndent. Pinna echoes are frequently considered as the main cue for the perception of elevation. The pinna activity is characterized with 6 echoes characterized with time delays and reflection coefficients as in following equation,

where A -> amplitude, B -> offset, D -> scaling factor.

Out of these parameters, the scaling factor, D, proved to be the key factor that accounted for HRTF individualization. During informal listening tests, particularly in a spin test (sound source is made to be perceived to move around along azimuth and elevation independently), varying just the scaling factor proved to adapt the HRTF to the individual listener.


Pinna reflections from HRIR measurements

Pinna events modelled as echoes for different elevation angles

Pinna reflections implemented using fractional delay-lines

RESULTS: Binaural model


<click on underlined hyperlinks under the figures to view the code used to plot these results>

Head-related Impulse Response (HRIR) plots comparison

Measured HRIR

vs

Modelled HRIR

Measured HRIR for azimuth = 20, elevation = 45

3D plot of measured HRIR across elevation angles

Head-related Transfer Function (HRTF) plots comparison

Measured HRTF

vs

Modelled HRTF

Measured HRTF for left & right ear

Monaural Impulse Response

Measured HRTF for azimuth angles

Measured HRTF 3D plot across azimuth angles

Binauralization of Room Impulse Responses

Auralization is process of rendering audible (imaginary) sound field of a source in space using physical/mathematical modelling, in such a way as to simulate the binaural listening experience at a given position in the modeled space.

Two approaches are investigated in this project:

1. Stereo-binaural up-mixing using frequency domain (signal processing) techniques:

Ambience extraction and source-unmixing based on panning coefficients and inter-channel coherence is carried out on a stereo mix to obtain surround which is then convoled with appropriate HRTF function followed by delay compensations for binaural playback.

2. Binauralization of RIRs using time domain (acoustic modelling) techniques:

This is a more low-level approach in which direct sound, early reflections, geometric reflections and diffuse reverberation components and extracted from monaural impulse responses of a space and binauralized for plausible binaural playback.


These figures illustrate concept flow diagrams of various auralization processes:
MUMT605_FinalProject_AjinTom.pdf
Evolution of RMS over front left-center-right speakers
(right) Ambience extraction:

Binauralization of monaural IRs are carried out as follows:

  1. Given a monaural room impulse response, the direct sound / early reflections components are extracted based on onset time of first peak. This segment is convolved with front left & front right HRTFs so that direct sound appears from the source position.
  2. Geometric reflections are extracted using raised-cosine weighting filters and hanning windows by tracking the local minimas. Each of the geometric reflections are convoled with HRTFs of pre-defined azimuth and elevation angles based on the shape of the room.
  3. Diffuse reverberation componenets are convolved with binaural noise using overlap-and-add techniques with the help of hanning windows.

Finally, the 3 segments are stitched/added together to produce a binaural room impulse reponse which can be convolved with an anechoic recording of sound source. Following is the block diagram of the algorithm:

Implementation (MATLAB)

The algorithm requires the following input data: omni-directional monaural input response, basic geometric information about room volume and direct sound's direction (in this project's context we do not account listener position shifts). The algorithm uses a set of head-related impulse responses and a pre-processed sequence of binaural noise. Both were acquired based on measurements with a Neumann KU 100 artificial head for near-field as well as far field.

The algorithm only applies to frequency components above 200 Hz. For lower frequencies the interaural coherence of typical binaural impulse response is generally one and the omni-directional impulse response is maintained. The filtering is carried out using Chebyshev Type II filters.

The first relevant maximum in the impulse reponse is identified as direct sound using onset detection. The time section from 5 ms to maximal 150 ms is assignemed to the early reflections and the transition towards the diffuse reverberation. In order to determine the early reflections, the energy of a sliding window of 8ms length is calculated and high-energy areas are marked. Peaks above a certain threshold are determined and assigned to geometric reflections. Most of the magic numbers mentioned above are obtained by averaging through several runs of the algorithm in various settings and informal listening tests.

A parametric model of the early reflections is created using convolutions, HRTFs and a weighting function. In order to determine sections with strong early reflections in the omnidirectional RIR, the energy is calculated using a sliding window of 8ms length and time sections which contain high energy are marked. The windowed sections are implemented as raised cosine filters whose window lengths are rasied to the next highest power of 2 for ease of implementation and faster convolutions. The weighting functions are convolved with hanning windows to avoid abrupt peaking, which led to pops and clicks during listening tests. A windowed section around each of the peals is considered as one reflection.

Each of the windowed sections are convolved with HRIRs of different directions. The directions are determined through a lookup table which holds values of incidence directions of the synthesized reflections based on a spatial reflection pattern adapted from a shoebox room with non-symmetric positioned source and received, adapted from literature. The convolutions are implmeneted as frequency domain multiplications and inverse Fourier transforms, as this proved to be much faster in MATLAB than time-domain convolutions. Interpolation in the spherical domain is performed to synthesize interim directions.

The diffuse part of the synthesized Binaural Room Impulse Response is calculated as follows: The omni-directional RIR, excluding the sections of the direct sound and geometric reflections is convolved with small sections of 2.7 ms (128 samples at 48kHz sample rate) of the binaural noise with 75% overlap. Thus, the binaural parameters like interaural coherence of the noise and frequency-dependent envelope of the omni-directional room impulse room response is maintained. The lengths of the time sections were determined by informal listening tests during development and implementation. (Above figure on the right.)

Finally, all the 3 components of the synthesized impulse reponse are appropriately weighted and summed up to give the final BRIR.

Audio demos available in https://github.com/ajintom/hrtf/tree/master/binRIR as demo_out_*.wav

Original Audio -> Final audio convolved with BRIR

Following are the results of the BRIR synthesis algorithm, click on each image to hear the audio output:

Evolution of Monaural Impulse Response -> Binaural Room Impulse Response

Conclusion and Future Work

  • An analytical model for binaural sound synthesis is succesfully implemented and obtained plausible sonic results.
  • The model was developed using stable digital IIR filters, fractional delay lines, bilinear transformations with barely any computational load and memory. The convolutional approach for synthesizing binaural sound proved to be costly in terms of memory since it required storing HRIRs for different azimuths and elevations. Moreoever, the convolutional approach would require spherical interpolation to achieve continuous variation of directions.
  • HRTF individualization was explored using a spin test with white noise and sine tones; the scaling factor D proved to have strong impact for individualization.
  • Room auralization was implmented using time-domain as well as frequency domain techniques. On comparing sonic results of the two, the Binauralization of Room Impulse Responses method proved to give plausible perception of the room with cleaner sonics outputs.
  • The implementation of synthesizing BRIR was carried out by segmenting direct sound, geometric/early reflections and diffuse reverberation, processing them individually using weighting functions, raised cosine-filters and convolutions followed by stitching them back to create BRIRs.
  • On-going and future work include developing a real-time framework which accounts for real-time binauralization and room auralization that takes care of listener position shifts and head tracking. Most of the binauralization model has been implemented as a VST plugin in JUCE.

References

  • Brown, C.P. and Duda, R.O., 1998. A structural model for binaural sound synthesis. IEEE transactions on speech and audio processing, 6(5), pp.476-488
  • Avendano, C. and Jot, J.M., 2002, June. Frequency domain techniques for stereo to multichannel upmix. In Audio Engineering Society Conference: 22nd International Conference: Virtual, Synthetic, and Entertainment Audio. Audio Engineering Society
  • Rindel, J.H. and Christensen, C.L., 2003, April. Room acoustic simulation and auralization–how close can we get to the real room. In Proc. 8th Western Pacific Acoustics Conference, Melbourne
  • Pörschmann, C., Stade, P. and Arend, J.M., 2017, June. Binaural auralization of proposed room modifications based on measured omnidirectional room impulse responses. In Proceedings of Meetings on Acoustics 173EAA (Vol. 30, No. 1, p. 015012). ASA


Full code, implementation and extended results available on ajintom.com/618/