PhD work

Here is a brief abstract of what I have been doing during my PhD work:

Ever wondered how your brain focuses on the speech of the person you are conversing with while you are in a noisy environment ?

The phenomenon is called cocktail party effect in Neuroscience and blind source separation in the signal processing world. Developing and implementing algorithms using just a single mixture of two or more audio sources is basically what my mainstream project during my PhD. The potential applications are in hearing aid systems, signal pre-processing in speech recognition systems, medical signal processing and many artificial intelligence related applications of sound.

One of the classical approach towards the problem is Independent Component Analysis (ICA). But this approach fails when only a single mixture of multiple speakers is provided (as in our case). When only a single mixture of multiple audio sources is available, the statistics of the audio sources are the only cue one could exploit in order to perform the task of source separation.

I developed and implemented both linear and non linear methods for monaural audio source separation. The linear approaches were also implemented in real-time achieving roughly ~46 ms of audio latency. Click here for the paper published in Neural Computation describing the linear approaches. In brief 3 linear methods are presented.

1) Eigenmode analysis of covariance difference (EACD) to identify spectro-temporal features associated with large variance for one source and small variance for the other source.

2) Maximum likelihood demixing (MLD) in which the mixture is modeled as the sum of two Gaussian signals and maximum likelihood is used to identify the most likely sources.

3) Suppression-regression (SR) in which auto-regressive models are trained to reproduce one source but suppress the other.

We compare our linear methods with the non-linear method for source separation such as Non Negative Sparse Coding (NNSC) and show that overall our methods perform significantly better (p<0.01) in terms of .

A non-linear approach called Multi Layered Random Forest (MLRF) is also developed. State of the art results were achieved with performance significantly better than the Deep learning approaches for monaural source separation, Our method uses random forest to estimate an Ideal Binary Mask (IBM) for each of the sources present in the mixture.

We quantify the performance of our algorithms in terms of the residual error (between the estimated and the original spectrograms), audio waveform signal-to-noise ratio (SNR), (higher SNRs and lower residuals), Perceptual evaluation of speech quality (PESQ) scores and Short-Time Objective Intelligibility (STOI) scores.

The simplicity, ease in real-time implementation and yet better performance than NNSC is the strength of my linear methods. The state of the art performance in the field is the main strength of my MLRF method.

I have also worked on spike data mainly to generate spikes from analog audio and reconstructing back the audio. This work has been published in journal of Frontiers in Neuroscience (click here to access the paper).

Page updated

Google Sites

Report abuse