Orthogonal auto-encoder

Adaptive filter-bank using a convolutional auto-encoder

Wavelet-based methods are known to be better suited for analyses of transient signal structures than the more ubiquitous Fourier-based counterparts. While a Fourier analysis decomposes signals into sines and cosines, the wavelet method uses piece-wise continuous functions at various scales as bases. This helps provide improved time and frequency resolution and is therefore more effective in detecting transient patterns like spikes and bursts in 1D signals, and edges in 2D.

The Discrete Wavelet Transform (DWT) is a time-frequency representation that has been used to not only facilitate better signal analysis and modification in tasks like signal compression and denoising, but also in machine learning tasks, by offering a more suitable representation from which to extract features.

In recent deep learning methods, instead of using a general representation like the DWT or STFT as the input, some models involve specific layers at the beginning tasked with learning a representation that is optimized for the task at hand, from the raw audio. Motivated by recent work on the use of an autoencoder model to learn the STFT (see here), we set out to extend the principles of a multi-resolution perfect reconstruction filterbank (below figure) to a convolutional autoencoder architecture.

The goal was to learn the filters best suited for the analysis and synthesis of percussive drum sounds. Hence, the filterbank is "adaptive", because the filters are not obtained analytically, but through optimization. And we also tried to extend the orthogonality that is commonly associated with the multi-resolution filterbanks to the autoencoder, by adding suitable regularization terms to enforce an orthogonal relationship between the analysis and synthesis filters.

Adaptive filter-bank using a convolutional auto-encoder

Further reading: