Videos

The SpeechBrain Project

For more videos on SpeechBrain, please visit the SpeechBrain YouTube Channel

SpeechBrain is an open-source and all-in-one speech toolkit relying on PyTorch.

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition (both end-to-end and HMM-DNN), speaker recognition, speech separation, multi-microphone signal processing (e.g, beamforming), self-supervised and unsupervised learning, speech contamination / augmentation, and many others.

XAI-SA:Explainable Machine for Speech and Audio Workshop

For all the videos of the ICASSP 2024 workshop on Explainable Machine for Speech and Audio, please take a look here.

The first workshop on Explainable Machine Learning for Speech and Audio aims at fostering research in the field of interpretability for audio and speech processing with neural networks. The workshop focuses on fundamental and applied challenges to interpretability in the audio domain, tackling formal descriptions of interpretability, model evaluation strategies, and novel interpretability techniques for explaining neural network predictions.

Toward Unsupervised Learning of Speech Representations

Learning good representations without supervision is still an open issue in machine learning, and is particularly challenging for speech signals, which are often characterized by long sequences with a complex hierarchical structure. Some recent works, however, have shown that it is possible to derive useful speech representations by employing a self-supervised encoder-discriminator approach. This video summarizes some recent work on self-supervised learning for audio and speech representations.

A brief Introduction to SincNet

Promising results have been recently obtained with Convolutional Neural Networks (CNNs) fed by raw speech samples directly. Rather than employing standard hand-crafted features, the latter CNNs learn low-level speech representations from waveforms, potentially allowing the network to better capture important narrow-band speaker characteristics such as pitch and formants. Proper design of the neural network is crucial to achieve this goal. This video describes a novel CNN architecture, called SincNet, that encourages the first convolutional layer to discover more meaningful filters. SincNet is based on parametrized sinc functions, which implement band-pass filters. In contrast to standard CNNs, that learn all elements of each filter, only low and high cutoff frequencies are directly learned from data with the proposed method.

The PyTorch-Kaldi Toolkit

The availability of open-source software is playing a remarkable role in the popularization of speech recognition and deep learning. Kaldi, for instance, is nowadays an established framework used to develop state-of-the-art speech recognizers. PyTorch is used to build neural networks with the Python language and has recently spawn tremendous interest within the machine learning community thanks to its simplicity and flexibility.

The PyTorch-Kaldi project aims to bridge the gap between these popular toolkits, trying to inherit the efficiency of Kaldi and the flexibility of PyTorch. PyTorch-Kaldi is not only a simple interface between these software, but it embeds several useful features for developing modern speech recognizers.

Cooperative Networks of DNNs

A prominent limitation of current distant speech recognition systems lies in the lack of matching and communication between the various technologies involved in the distant speech recognition process. The speech enhancement and speech recognition modules are, for instance, often trained independently. Moreover, the speech enhancement normally helps the speech recognizer, but the output of the latter is not commonly used, in turn, to improve the speech enhancement. To address both concerns, we propose a novel architecture based on a network of deep neural networks, where all the components are jointly trained and better cooperate with each other thanks to a full communication scheme between them.

Twin Regularization for Online Speech Recognition

Online speech recognition is crucial for developing natural human-machine interfaces. This modality, however, is significantly more challenging than off-line ASR, since real-time/low-latency constraints inevitably hinder the use of future information, that is known to be very helpful to perform robust predictions. A popular solution to mitigate this issue consists of feeding neural acoustic models with context windows that gather some future frames. This introduces a latency which depends on the number of employed look-ahead features. This paper explores a different approach, based on estimating the future rather than waiting for it. Our technique encourages the hidden representations of a unidirectional recurrent network to embed some useful information about the future

Interview at RTTR

My interview on speech recognition technologies and its future (in Italian) .

Prototype of the DIRHA system

The DIRHA project addressed the development of voice-enabled automated home environments based on distant-speech interaction in different languages. A distributed microphone network was installed in the rooms of a house in order to monitor selectively acoustic and speech activities observable inside any space, and to eventually run a spoken dialogue session with a given user in order to implement a service or to have access to appliances and other devices. In the video, you can find some of this functionalities implemented in a DEMO system.

Page updated

Google Sites

Report abuse