Deep Learning based Monaural Separation of Harmonic Sounds for Audio Signal Enhancement and Pattern Recognition
Deep Learning based Monaural Separation of Harmonic Sounds for Audio Signal Enhancement and Pattern Recognition
INTRODUCTION
Music source separation (MSS) is the task of separating a music piece into individual sources, such as vocals and accompaniment. MSS is normally more challenging. Typically, there may be many musical instruments and voices in a two-channel recording, and the sources have often been processed with the addition of filters and reverberation (sometimes nonlinear) in the recording and mixing process. In some cases, the sources may move or the production parameters may change, meaning that the mixture is time varying. Nevertheless, musical sound sources have particular properties and structures that can help us. For example, musical source signals often have a regular harmonic structure of frequencies at regular intervals and can have frequency contours characteristic of each musical instrument. They may also repeat, in particular, temporal patterns based on the musical structure.
In recent past classical approaches like ICA, NMF has been used to separate audio. The main problem in ICA is that it needs the number of mixtures to match the number of sources to separate audio. This is known as under-determined ICA. Among classical methods NMF performed comparatively well compared to other methods. With the rise of neural networks single channel audio source separation has developed drastically in the past era. There are time domain architectures as well as frequency domain architectures.
In this project we intend to separate audio from using and modifying different methods and compare the results and find the best solution.Â
The project proposes a novel deep learning-based neural network architecture named Y-Net for music source separation. Y-Net performs end-to-end hybrid source separation by extracting features from both spectrogram and waveform domains, and it predicts a spectrogram mask to separate vocal sources from a mixture signal. The results show that the proposed architecture is effective for music source separation with fewer parameters than existing state-of-the-art methods. The key contribution of this work is the integration of both raw audio and spectrogram representations to estimate a time-frequency mask for separating the singing voice. This approach offers the advantage of learning filters that are not present in the Short-Time Fourier Transform (STFT) spectrogram while incorporating phase information. The proposed Y-Net architecture provides a promising approach for improving the accuracy and efficiency of music source separation, with potential practical applications in creating karaoke tracks, transcribing music, and music production.