Y-net
Y-net
This work presents a neural network architecture that takes raw audio waveform and spectrogram as inputs, and estimates a mask similar to the shape of input spectrogram that can be used to separate and extract singing voice from a mixture. Specifically, the estimated mask is multiplied with the original mixture spectrogram to obtain the magnitude spectrogram of the singing voice. It is worth noting that our network does not alter the phase of the original mixture signal. This approach is similar to the convolutional recurrent neural network with the attention framework proposed by C. Sun et al. for speech separation in monaural recordings [16]. To obtain the final predicted source signals, we combine the estimated magnitude spectrogram with the phase spectrogram of the original mixture signal to create the complex spectrogram. We apply the Inverse Short Time Fourier Transform (ISTFT) to obtain the vocal waveform as shown in Figure 1. By preserving the phase information, the predicted source signals retain the original temporal structure and sound coherence. Our network consists of two encoder branches, namely the waveform branch and the spectral branch, which take the raw audio and spectrogram, respectively. The front end of the raw audio encoder branch includes a learnable spectrogram module, which converts the raw audio to a spectrogram format. Then two encoders extract features from both inputs that are then merged at the network core and sent to the decoder. The decoder consists of a stack of up-sampling layers which decodes the features to the original size of the spectrogram, as shown above. It is worth noting that the two encoders and the decoder structure are symmetric.
The representation from the learnable spectrogram is then sent through five encoder layers that double the number of channels with increasing dilation rate while halving the size of the input spectral image as it goes deep. Each encoder layer consists of a batch normalization layer followed by two convolution layers with the kernel size of 3x3, a max pooling layer, and a dropout layer. LeakyRelu is used as the activation function after each convolution. Before feeding the spectral branch, we transform the raw audio mixture waveform to a spectrogram using STFT. In our experiment, the input spectrogram consisted of 1024 frequency bins and 128 time bins. The spectrogram is sent through a similar five-encoder structure to the waveform branch, except the activation function after each convolution in the encoder is replaced by ReLu. Following the fifth encoder block, the network concatenates features that have been extracted from both the spectral and waveform branches, provided they are of equal size. This concatenation takes place at the network’s core and is subsequently followed by two convolutional layers and a batch normalization layer. These concatenated features are then activated using ReLu before being forwarded to the decoder component of the network.
Skip connections are an important aspect of our architecture because they propagate the information from both encoder branches to a shared decoder branch, bypassing the intermediate processing stages within the network. The use of skip connections helps to mitigate the effects of down-sampling and up-sampling operations, which can otherwise cause important information in the audio signal to be lost. Additionally, skip connections provide a direct connection between the encoder and decoder layers of the network, allowing the model to better preserve high-frequency details and fine-grained structures in the output audio signal
Encoder output is sent through 5 decoder layers where each decoder layer consists of an up-sampling layer followed by two convolution layers activated with the ReLu function with a kernel size of 3x3 and a batch normalization layer. For each decoder layer, there are three inputs coming from the previous decoder layer and two skip connections coming from the encoder layers in both the spectral and waveform branches. The final decoder layer output is then sent through a convolution layer of kernel size 1x1, which is then activated by Hardtanh in order to enhance the vocal features and suppress other features in the frequency domain. Finally, the decoder output gives the estimated mask. Element-wise multiplication of the decoder output with the input mixture spectrogram gives the spectrogram of the singing voice. In summary, the encoder-decoder architecture provides a way to extract meaningful features from both inputs and then use these features to generate a mask to separate the singing voice.