Tutorials

What are the components of a modern speech recognition system?

Building a automatic speech recognition (ASR) system using deep neural networks will usually involve training a supervised acoustic model using speech and corresponding text. The performance can be improved using a language model, trained only on relevant text (i.e text from target language, domain, dialect etc). The performance can be further boosted by obtaining acoustic features from pre-trained self-supervised models such as Wav2Vec2.

How to build a speech recogniser?

There are many high quality toolkits built to train and decode ASRs. Some of them are listed below -

Fairseq - https://github.com/facebookresearch/fairseq
Nemo - https://github.com/NVIDIA/NeMo
Espnet - https://github.com/espnet/espnet
Speechbrain - https://github.com/speechbrain/speechbrain
K2 - https://github.com/k2-fsa/icefall
Huggingface - https://huggingface.co/

Note that each of them has a specific data preparation format and some toolkit specific features. You will find installation instructions and training tutorials in the links above.

Where can I learn more about speech recognition systems?

There are a lot of resources available online on ASRs. Some of them are listed below -

https://youtu.be/q67z7PTGRi8 - Talk by Dr Preeti Jyothi (IITB)
https://youtu.be/cmy2zf6CuH4 - Talk by AI4Bharat (IITM)
https://youtu.be/DsYDmg72K1k - Course by Dr Shinji Watanabe (CMU)
https://youtu.be/XB7EWu0awSM - Talk by Prof S Umesh (IITM)

Page updated

Google Sites

Report abuse