Current projects

The goal is to create a single, flexible, and user-friendly toolkit that can be used to easily develop state-of-the-art speech technologies, including systems for speech recognition (both end-to-end and HMM-DNN), speaker recognition, speech separation, multi-microphone signal processing (e.g, beamforming), self-supervised and unsupervised learning, speech contamination / augmentation, and many others. The toolkit will be designed to be a stand-alone framework, but simple interfaces with well-known toolkits, such as Kaldi will also be implemented.

SpeechBrain is currently under development and has been announced in September 2019. A first alpha version will be available in the next months.

PASE (Problem-Agnostic Speech Encoder)

PASE is a project that aims to improve self-supervised method for audio and speech. In the context of the PASE project we are exploring the use of a neural encoder followed by multiple workers that jointly solve different self-supervised tasks. The needed consensus across different tasks naturally imposes meaningful constraints to the encoder, contributing to discover general representations and to minimize the risk of learning superficial ones.

https://github.com/santi-pdp/pase

ResearchMatch Project

I’m currently co-leading a cross-disciplinary peer-reviewed research match project funded by the McGill Initiative in Computational Medicine. The project is on the assessment of speech and language disorders via automatic speech recognition. The team is composed of both speech recognition experts and researchers in speech disorder assessment.

https://www.mcgill.ca/micm/programs/micm-researchmatch-0/researchmatch-v2-results

Past projects

PyTorch-Kaldi

Pytorch-kaldi is a project for developing state-of-the-art DNN/RNN hybrid speech recognition systems. The DNN part is managed by pytorch, while feature extraction, label computation, and decoding are performed with the kaldi toolkit.

https://github.com/mravanelli/pytorch-kaldi/


DIRHA

The DIRHA project addresses the development of voice-enabled automated home environments based on distant-speech interaction in different languages.

A distributed microphone network is installed in the rooms of a house in order to monitor selectively acoustic and speech activities observable inside any space.

The targeted system analyses the given multi-space acoustic scene in a coherent way, by processing in a parallelized fashion simultaneous activities which occur in different rooms, and in case by supporting at the same time the interaction with users who may speak in different areas of the house.

https://dirha.fbk.eu/

DOMHOS

The purpose of the DomHos project was to lead distant-talking speech recognition technologies inside the surgery room, a challenging but fascinating scenario. In particular, the first goal was to enable the surgeon to dictate some speech notes during the operation itself. This could be potentially very helpful since, at the end of the surgery, the doctor already has a draft report about the operation, avoiding the boring activity of writing it from scratch.

AURORA (ALADDIN program)

One of the major current challenges in multimedia research is that of quickly and accurately finding events of interest in very large video collections. This requires an efficient automated analysis of massive amounts of video that may vary dramatically in quality and composition: similar activities of interest may appear very different, while very different events may share many common elements. ICSI worked on AURORA in collaboration with multiple institutions, led by SRI-Sarnoff. AURORA is funded by IARPA’s ALADDIN program (Automated Low-level Analysis and Description of Diverse INtelligence video), which aims to combine expertise in video extraction, audio extraction, knowledge representation, and search technologies in a revolutionary way, to create fast, accurate, robust, and extensible technology that supports the multimedia analytic needs of the future.

SWORDFISH (BABEL program)

Researchers are developing ways to find spoken phrases in audio from multiple languages. A working group, called SWORDFISH, includes scientists from ICSI, the University of Washington, Northwestern University, Ohio State University, and Columbia University. The acronym expands to a rough description of the effort: Spoken WOrdsearch with Rapid Development and Frugal Invariant Subword Hierarchies. The “Rapid Development” aspect makes this project unique for research in speech recognition and related tasks, as the goal is not so much to reduce the error rate, but rather to reduce the time required to build a system for a new language (given a limit on allowable errors). The word “Frugal” is apropos for work on a new language, particularly one that has not been significantly studied before (in the machine learning context). In this case massive resources are often unavailable as they are for languages such as English. Consequently, one must be frugal in building detailed models of speech in such languages, using only those representations that are particularly appropriate. One such parsimonious representation would be a hierarchy of phrases, words, and subwords that are maximally invariant over a number of similar languages, so that available resources can be best used. Funding provided by IARPA.