Please follow the latest work here: https://banglanlp.org

Work related to Bangla Language:

Bangla Word-embedding model

Joint work: https://github.com/cogniinsight/Word-embedding-model-for-Bangla

Leaded the team of Bangla Text to Speech Project. This project aims to develop Bangla TTS system using open source festival engine developed by CSTR and festvox tool of CMU speech group. First version was publicly released on 19 February, 2009, which is now available for download. This project was awarded as most innovative project in BASIS softexpo 2010 (http://www.softexpo.com.bd/about.php). Received special award in National E-Content and ICT4D Award 2010 (http://www.eaward.org.bd/). Developed the first audio version of Bangla newspaper “Prothom-alo”.

See news on http://www.thedailystar.net/newDesign/news-details.php?nid=153325

This project involves the development of the following components:

  • Phoneme inventory: Acoustic analysis was performed to identify the total number of phonemes of Bangla Language. A small speech database was developed for acoustic analysis of Bangla phoneme inventory.

  • Text normalization: Two text normalization tools were developed using rule based system in two different languages such as java and scheme.

  • Letter to Sound: LTS system also developed using rule based system to handle unknown words and to develop pronunciation lexicon.

  • Pronunciation lexicon: Developed a pronunciation lexicon both manually and by using automatic LTS system. The lexicon contains 92K entries.

  • Intonation Modeling: Few works have done for labeling speech corpus to develop intonation model.

  • Diphone database for TTS: Developed a diphone database consisting 4355 diphones. This includes designing nonsense sentences from diphone list, recording by professional speaker, splitting and labeling.

  • Speech Corpus: Developed a speech corpus for TTS, which may also be applicable in ASR. Text was collected from various domains for speech corpus such as news, story, law, history and etc. The corpus has around 100K words, 18K unique words and 10K sentences. A professional studio and a professional speaker were hired for recording. Labeling was done based on sentence. http://sourceforge.net/projects/blp/files/Speech_Corpora/

Related publications:

See here


ASR

Speech Recognition System for Bangla: Was involved in the development of a domain dependent ASR prototype for agro based information system using sphinx framework.

Others

  • CRBLP Converter: CRBLPConverter is a software package to convert various TTF encoded Bangla documents to Unicode encoding. Different ASCII fonts reengineered to design the main engine. Involved in font reengineering process.

  • Corpus Analysis & Corpus Collection: Developed a tool for extensive corpus analysis on word frequency distribution. This tool was used to analyze the corpus for regularities and anomalies of Bangla words. Contributed to corpus collection and tool development process.

  • Localized URL: Designed and developed local URL data such as designing domain name, character set, gTLD and ccTLD for Bangla Language.

  • Terminology system: Involved in the development of a tool to search and modify existing terminology, add new terminology.

  • CLDR: Contributing this project from starting of 2008 to 2010. Responsibilities include submitting and moderating and the CLDR data.

  • Microsoft Windows Vista & Office-2007 Localization Project: Involved in Microsoft Localization project that includes the translation of Windows Vista and Microsoft Office 2007.