Tutorials
What are the components of a modern speech recognition system?
Building a automatic speech recognition (ASR) system using deep neural networks will usually involve training a supervised acoustic model using speech and corresponding text. The performance can be improved using a language model, trained only on relevant text (i.e text from target language, domain, dialect etc). The performance can be further boosted by obtaining acoustic features from pre-trained self-supervised models such as Wav2Vec2.
How to build a speech recogniser?
There are many high quality toolkits built to train and decode ASRs. Some of them are listed below -
Espnet - https://github.com/espnet/espnet
Speechbrain - https://github.com/speechbrain/speechbrain
Huggingface - https://huggingface.co/
Note that each of them has a specific data preparation format and some toolkit specific features. You will find installation instructions and training tutorials in the links above.
Where can I learn more about speech recognition systems?
There are a lot of resources available online on ASRs. Some of them are listed below -
https://youtu.be/q67z7PTGRi8 - Talk by Dr Preeti Jyothi (IITB)
https://youtu.be/cmy2zf6CuH4 - Talk by AI4Bharat (IITM)
https://youtu.be/DsYDmg72K1k - Course by Dr Shinji Watanabe (CMU)
https://youtu.be/XB7EWu0awSM - Talk by Prof S Umesh (IITM)