Prosodically Guided Phonetic Engine Development

  • Phonetic Engine:

The definition of a Phonetic engine has been evolving. In a crude manner, it can be thought like a machine that represents all the information present in the speech so that speech can be exactly reproducible (message at least). To develop the phonetic engine, one has to understand the information extraction in speech.

  • Design of Phonetic Engine:

In a very first model, the phonetic engine is thought of in different tiers, viz., phonetic transcription, syllabification, pitch marking (pitch index) and break index (break marking).

The Phonetic transcription tier consists of International Phonetic Alphabet (IPA)-based phonetic symbols along with a few diacritic marks. IPA symbols are basically articulatory features. Phonetic transcription resembles speech production information.

The syllabification tier consists of syllables formed by phonetic transcription and their timing alignment with respect to a waveform.

Pitch marking and break marking tiers are mostly consist of suprasegmental information. Prosodic break and marking are very important information, it is very difficult to mark them on absolute position so they are marked relative with respect to adjacent context.

Each tier can be designed either independently or jointly.

Phonetic transcription: Context independent monophonic are modeled as 3-states of the Hidden Markov model and each IPA symbol are modeled by HMM model. A flat start approach is employed since no phone boundaries are present.

Syllabification: Syllable segmentation can be performed using spectral transition measures or by estimating the minima short term energy profile.

Pitch and break marking: These are based on source characteristics so F0 estimation is used to design these tiers.

  • Search Engine:

Search engine (audio search) is the task of identifying spoken information within a spoken database. The spoken database contains many audio files, the search engine looks for a file that is probable to contain the query.

  • Data Collection and Corpora Development:

DA-IICT team has collected speech data and other relevant metadata in two Indian languages, viz., Gujarati and Marathi. These two languages are spoken mostly in two states of India, i.e., Gujarat and Maharashtra, respectively. The data is recorded in three different modes, viz., read, spontaneous and lecture modes. Different dialectal zones of Gujarat and Maharashtra are considered during the data collection phase.

  • Corpora Development:

In both languages, a total of 10 hours of data has been transcribed at the phonetic level, syllabification is performed. 1.5 hours of data have been marked using a prosodic label.

  • Team:

Instructor - Prof. Hemant A. Patil

Staff - Maulik C. Madhavi, Ankur Undhad, Shubham Sharma, Vibha Prajapati

Consultants - Rinni Pandya, Bhaveshri Parmar, Krupa Barot, Gayatri Prajapati, Maulik Patel, Roma Zala

Past staff - Kewal Malde, Bhavik Vachhani