Research topics (in English)

Almost all the humans use spoken dialog, which is the most natural communication method. If we can recognize/manage/ synthesize speech in computers, this speech can be not only the best method of communication but can also be used as data storage media. I am engaged in technologies on spoken language.

Recent system demonstration video has been published!

Demonstration video (Japanese)

Technologies instrauction (English)

Large vocabulary continuous speech recognition

Making transcriptions of monologues such as lectures is a very promising research area. We improve acoustic modeling of the human voice using models such as the Hidden Markov Model (HMM) and Deep Neural Network (DNN), and statistical language modeling (N-gram) . We also improve the decoding algorithm.

Recently, machine learning technologies progress drastically. So we adopt such techniques to speech recognition, to develop DNN-based End-to-end speech recognizer.

Elderly speech recognition

Elderly people tend not to be good at operating information devices. So speech recognition/spoken-dialog interfaces can give benefits to them. Elderly speech recognition research, however, does not progress so far. We are developing speech database of elderly people for constructing speech recognizer for elderly. We also study on how to use the database effectively and efficiently to model acoustics of elderly.

Noisy speech recognition

Degradation of speech recognition performance is problematic in practical speech systems. Standard evaluation frameworks for noisy speech recognition are very useful for comparison of many noise reduction methods. I am the leader of the developers.$B!G.(B group for the standardized evaluation framework series (CENSREC) , which contains data, recognition tools, and evaluation tools, and the frameworks distributed freely in public.

ETSI AURORA

SLP-WG(CENSREC/AURORA-J)

Spoken dialog interface (1) - for a friendly interaction-

The first impression of a spoken dialog system for novice users is that it is unnatural, because the time-lag between a human utterance and the system reply is too long and as such the user cannot distinguish whether or not the system works. This is one of the reasons why users do not feel that spoken dialog systems can be used in a comfortable, frendly manner. Thus, we focus on prosodic features like timing and pitch change in a dialog. Our dialog system has begun to speak with appropriate prosodic features considering previous user utterances. When the dialog gets lively, the pitch of the system utterances chase the user's pitch. On the other hand, we also study a semantic dialog strategy. We are now developing a robust and natural response generation method in a system that considers its own misunderstandings.

YouTube(Sorry in Japanese)

Spoken dialog interface (2) - Automatically responding...-

A system that works only when the user would like it to. . . This system is always silently near the user, but when the user wants to use to talk to it, the system responds naturally and works. To realize such a system, various cues such as user orientation, change of speaking style, and contents of user utterance, etc. , are employed.

YouTube(Sorry in Japanese)

Multimodal interface

For a mobile information terminal, we are developing multimodal interfaces using speech/spoken dialog inputs. The combinations of speech, touch-pen, touch panel, and finger pointing, for example, are promising.

When solving a geometric problem, one uses hands and voices:

YouTube(Sorry in Japanese)

Multimodal interfaces can be applied to autonomous vehicles!

YouTube(Japanese)

YouTube (English)

Cross-media information regtrieval

Consider we can retrieve multimedia information using speech, language, music, etc... For example, you can use texts to search musics. This is the cross-media information retrieval. This is an example of an information retrieval using speech -> language -> music:

YouTube(Sorry, in Japanese)