Trainable Windows in Sincnet

This website is part of the paper "Trainable Windows in SincNet Architecture for Speaker Recognition "

Authors : Prashanth H C, Madhav Rao, Dhanya Eledath, Ramasubramanian V.

SincNet architecture has shown significant benefits over traditional Convolutional neural networks especially on raw speech signals towards speaker recognition application. SincNet comprises of parameterized Sinc function as filters in the first layer followed by convolutional layers. Although SincNet is compact in nature and offers top level understanding of the features extracted, the effect of window function used in SincNet is not thoroughly addressed yet. Hamming, and Hann are popularly used as the default time-localized windows to reduce spectral leakage. Hence a comprehensive investigation of 28 different windowing functions on SincNet architecture towards speaker recognition task using TIMIT dataset was performed in this work. Additionally, Trainable window functions were configured with tunable parameters, to characterize the performance. The paper benchmarks the effect of the time-localized windowing function in terms of the bandwidth, side-lobe suppression, and spectral leakage for the filter banks employed in the first layer of the SincNet architecture. Besides, the parameterized Sinc filters preserved mel-scale representation which is a characteristics property of the neural network designed for speaker recognition task. Trainable Gaussian and Cosine-Sum functions exhibited relative improvement of 41.46%, and 82.11% in the sentence level classification error rate over Hamming window when employed on SincNet architecture.

Sentence level CER for all the windows :

Individual window results are available at :

Reading the information in this site :

Each window page has the following plots :

(a) : Time domain plot of all the 80 sinc filters

(b) : Time domain plot of the window function

(c) : Time domain plot of the 80 sinc filters after windowing

(d) : Cumulative Frequency Response of the windowed filters

(e) : The 80 filters in frequency domain

(f) : Frame level training error rate

(g) : Training loss vs epoch

(h) : Frame level test error rate

(i) : Test loss vs epoch

(j) : Sentence level test error rate

All files and results in this site are distributed under MIT License.

Copyright 2022 Prashanth H C (prashanth.c@iiitb.ac.in)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Resources :

Sincnet Paper : https://arxiv.org/abs/1808.00158