Term Projects

Term projects are small, independent research projects, which result in a small "publication" and "presentation". Teachers will propose topics, but students can also suggest own, suitable topics. This page lists old term projects for inspiration, current topics will be made available in class. This list will be updated for the 2016 semester!

2010 Term Projects

Computer vision to improve ASR


Speech enabled educational cellphone games


Synthesis of speech in various personalities

Speech conveys a wide range of side information, such as age, gender, and emotion of a speaker; parts of a speaker's personality are also conveyed by speech, or at least this is what a listener believes. We have a database of speech which is annotated with personality impressions and will investigate which parts of it can be combined to produce speech with various "personalities".

Specifics of dialectal speech in Arabic

Arabic consists of many dialects. On a clean-speech dialect, it is possible to investigate the specifics of various pronunciations statistically and automatically, to see which of these pronunciation differences are detrimental to speech recognition, and which are only observable. The project would both analyze differences, and build dialect-specific recognizers.

Speaker diarization [Open-set speaker identification] (closed)

Speaker Diarization is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It consists of two steps: speaker segmentation and speaker clustering. Most of speaker diarization systems operate without specific a priori knowledge of speakers or their number in the audio, however, they generally need specific tuning and parameters training for different audio types, such as broadcast shows, meetings, conversations etc. This project is about exploring new approaches to make a diarization system more robust and portable.

Hash-based language model (closed)

Modern speech recognition and machine translation approaches often require very large language models, which cannot be held in RAM, but need to be held on disc. For these, accessing the right memory location, and accessing it only once, is critical, because of the long access time. A hash-based approach should improve performance and speed up speech recognition significantly.

Improving a language model for oral reading

In project "LISTEN", children read out load a written story, which is then recognized by a speech recognizer to assess their pronunciation. The accuracy of the recognizer critically depends on the quality if the language model provided. As the children read a known text, we want to assess possibilities to "narrow down" the language model, for example by generating a "lattice" of possible variations or mistakes children are likely to do when reading, which is then used as a language model.

Acoustic scene analysis of Youtube videos (closed)

Youtube videos cover essentially every known topic and situation known to man. Speech recognition has proven to be very hard on this data, as it contains multiple speakers, bad microphones, background music or noises, or sometimes no sound at all, or a "sound-track" different from the original audio. Therefore, "auditory scene analysis" does not try to perform full speech recognition, but rather try to identify typical acoustic scenes, such as "outdoors", "in a car", "music playing", etc. We have a database of these videos and would like to compare several approaches.

Meeting Summarization (closed)


Accurate Speaker Identification through Audio Delay


Dialog Systems (closed)


Further Project Ideas

Low latency ASR

Standalone automatic speech recognition is usually performed in "batch" mode, ie a whole utterance is processed in one step. For many applications, incremental processing is desirable, and technically possible, e.g. for a speech-to-speech translation system, or the sub-titling of speeches. Here, suitable chunks (partial hypotheses) can be processed, e.g. translated or displayed. Because of long-term language models (mostly), the best (partial) hypothesis at time T might change, as soon as all data available at time T+t (t>0) has been processed. At this point, any processing of the best hypothesis available at time T has become invalid. The project would consist of an analysis of how many hypotheses become invalidated as t increases, and the development of predictors which would allow to extend t beyond the current value.

Durational entropy as a predictor of word accuracy

Duration modeling so far has not helped to improve the accuracy of HMM-based automatic speech recognition. Preliminary experiments on a number task seem to indicate that the concept of "durational entropy" can help to identify words that have been recognized incorrectly, or to predict errors. This study would evaluate this concept on a large vocabulary task and see which other factors can contribute to identifying the parts of speech which a recognizer didn't recognize correctly. This "self-awareness" is an increasingly important part of the overall speech recognition process.

Multi-core decoding of speech

A decoder is the "core" component of a speech recognition system, which brings together the acoustic model, and the language model. InterACT has an implementation where the acoustic model computation is already distributed across a multi-core processor, which brings nice speedups. In the next step, the language model access should also be distributed across multiple cores, so that the decoding can fully utilize modern processors.

Dialect identification for Arabic speech recognition

Arabic speech recognition systems need to handle different dialects. In order to run a dialect-dependent speech recognition system, we need to first identify the dialect of the input speech. This project is about exploring various techniques to reocgnize 5 Arabic dialects.

Open-set speaker identification

Speaker identification is the process of determining the correct speaker of a given utterance from a group of registered speakers. If this process includes the option of declaring that the utterance does not belong to any of the registered speakers, then it is referred to as open-set speaker identification. This project is about exploring different modeling approaches to improve the accuracy and robustness of open-set speaker identification.

Automatic generation of in-domain vocabulary

In the early stages of development, domain-specific speech recognition suffers from a number of problems, because the vocabulary of the domain is not known. If one recognizes sentences from the domain with a domain-independent recognizer, some words can be found without manual intervention, but not all. This project would develop a tool, which uses the domain independent recognizer (or a domain dependent recognizer) to transcribe the new data, and mark the parts which need to be transcribed manually, in order to uncover new, missing words.

Pronunciation scoring using articulatory features

"Articulatory features" describe speech sounds not in terms of phones (e.g. [b]), but as for example a "voiced labial plosive". Using this description, typical pronunciation mistakes can be expressed not as "you confuse /b/ with /p/", but by saying "you're getting the voicing of your plosives wrong". This project would investigate if such a system could be built, and what advantages it would have.