Our Focus:  
(1) Spoken Language Understanding (ex. Speech Recognition/Synthesis and Natural Language Processing)
(2) Multimedia Signal Processing        (ex. Speech, Audio, Text, Image and Video Signal Processing)
(3) Machine Learning                            (ex. Deep Learning)

Spoken Language Understanding
  • Machine Translation

Every language in Taiwan, with the exception of dominant Mandarin, is endangered. Since a language is the bearer of a culture, an incredible freight of human knowledge and experience and understanding--of epics, myths, nursery rhymes, proverbs, parables, ritual formulae, jokes, love-songs and dirges. Therefore, something must be done before it's too late. To take the first step, we are now working hard to implement a character-level sequence to sequence-based Chinese text to Taiwanese language interpreter for Taiwanese Speech Synthesis.

  • Chatbot

*Demo: English Chatbot, Chinese Chatbot (18+)
Deep neural conversational modeling, especially the sequence to sequence method, is a new approach for spoken dialogue system research. Therefore, a character-level Chinese gossip chatbot, had been implemented to test the idea.

  • Spoken Dialog System

*Demo: Voice Assisted Car Navigation System
Speech is the most natural, powerful and universal media for human-machine/computer communication. Our focus is to develop all necessary modules for Spoken Dialog System including robust speech, speaker and language recognition and natural speech synthesis.

  • Realtime Multilingual ASR

*Demo: Youtube
Android App with RealTime Multilingual (mixed Chinese/English) Speech Recognition Server based on:
1) Deep Neural Networks, 2) WebSocket/Http and 3) GStreamer.

  • Automatic Speech Transcriber

*Demo: Youtube or Google Drive
Automatic Broadcast Radio Subtitle Generation (audio to *.srt) using Deep Neural Networks. This is a real-time multilingual (mixed Chinese/English) speech recognizer. Our aim is to study semi-supervised/unsupervised training.

  • Large-Scale Radio Speech Corpus

The Radio Speech Corpus will consists more than 1,000 hours of audio recordings, provided by Taiwan’s
National Education Radio (NER) archive. This corpus provides a rich resource for research in speech and automatic speech recognition (ASR). It will be publicly released in near future.

  • Chinese-English MixTTS

*Demo: Google Drive
To approach the goal of establishing an End-to-End speech synthesis system, we propose to use character-level recurrent neural networks (RNNs) to directly convert input character sequences into latent linguistic feature vectors.

  • Automatic Sentiment Information Extraction
Automatic sentiment information extraction
of social network articles has many essential applications. Following the valence-arousal space framework, two approaches a neural network (NN) model that could predict the valence-arousal ratings of Chinese words.

  • Smart Home User Interface
*Demo:SmartTV - Voice and Gesture Control
Smart Home is a hot research and development (R&D) topic recently. Our focus is to build a spoken dialog-based Smart Housekeeper and make speech, speaker and language recognition more robust and useful for our daily life.

  • DAISY Digital Talking Book Player

We have developed a speech synthesis-enabled DAISY player. It is small, portable and designed for use by people with "print disabilities", including blindness, impaired vision, and dyslexia.

  • Speech Summarization, Analysis and Organization
(from http://diging.github.io/tethne)

Multimedia data (in Youtube, Live broadcast,...) are dramatically increasing recently. We therefore need a efficient way to browse, use and archive those speech data through speech/speaker recognition, story summarization and organization.

  • Computer-Assisted Language Learning

Secondary language learning is more and more important today (global village). Therefore, we would like to build a computer-assisted language learning system that could simultaneously detect pronunciation errors, speech prosody deviations and dialogue act mistakes.

Multimedia Signal Processing

  • Audio Event Tokenizer

To automatically extract speech segments from an audio, a deep convolutional neural network was applied. An audio tokenizer is then built to cut input audio files into speech, music and other segments and output *.srt files.

  • Distant Speech Recognition
Microphone Array

The performance of conventional Automatic Speech Recognition (ASR) systems degrades dramatically if the microphone is far away from the speakers. To alleviate this issue, an eight omni-directional electret condenser microphone array and recurrent neural networks-based algorithm are built to resist distortion of background noise, overlapping speech and reverberation. 

  • Acoustic Event Detection

Sound event detection is essential for advanced smart-home applications. We would like to build a system that integrates several Kinect One sensors for elderly care, baby monitor and especially, home security.

  • Microphone Array

Microphone array is the key to the success of mobile phone, Smart-TV and Smart-Home applications. Especially, a  good microphone array should not only remove background noise and also allow a speaker to freely move to any position.

  • Speech Enhancement
(from http://sip.sys.es.osaka-u.ac.jp/~kawamura)

Speech enhancement/echo cancellation are important for good mobile communication. Especially, it should not distort speech or interrupt communication.

          Super Human-Machine Interface

  • Haptic Communication
(from http://api.ning.com/)

Haptic man-machine interface aims to augment the physical environment around us through haptic feedback to extend our sense (sixth sense or superhuman sense)


  • The Oriental Language Recognition 2017 Challenge

Our team has been ranked the [3rd] out of 19 in the overall performance (average of the performance of all conditions in Cavg) ranking list, and the [4th][2nd] and [2nd] out of 19 in the 1-, 3-second and full length utterance ranking list. Note: The submissions with a star after the team name is an extended submission. This means they should not be treated equally as the regular submissions (without a star).


  • IJCNLP-2017 Task 2: Dimensional Sentiment Analysis for Chinese Phrases
Our team NCTU-NTUT has be ranked the [4th] out of 19 with Mean-Ranked 6.5 out of 23 runs submissions.

  • The 52-th Annual Broadcast Golden Bell Awards
Our Team received the "創新研發應用獎" (Technology Innovation Award) in the Broadcast Golden Bell Awards 2017 ceremony for Automatic Subtitling of Broadcast Radio Programs and Interactive Learning Systems.


  • OC16 Chinese-English MixASR Challenge
Our deep LSTM-based ASR achieved overall performance Rank 2, English performance Rank 1 in the extended submission of OC16 Chinese-English MixASR Challenge

Machine Learning

  • Reference GPU Server for Machine Learning Research
Our workhorse: Tyan B7119 4U Server + GigaByte GTX 1080Ti* 10

  • Language Recognition

Our gated DNN system for the NIST 2015 language recognition i-vector machine learning challenge. It was designed to solve the language clustering and out-of-set detection issues simultaneously. It achieves a relative performance gain of up to 51%, compared to the baseline cosine distance scoring (CDS) system provided by organizer.

  • Speaker Recognition

Example of FA-DNN outputs for speaker recognition: (a) original speaker i-vectors, (b) purified speaker i-vectors.

  • Factor Analysis Neural Networks

Although, deep neural networks (DNNs) are very powerful but it still could be easily affected by noises. We have developed a new factor analysis DNN (FA-DNN) structure and training algorithm that can successfully separate wanted signals and noises.

  • Hand Writing Recognition

Example of FA-DNN outputs for handwritten digit recognition: (a) original digits, (b) purified digits.

                 Wireless Communication

  • Acoustic Communication/Networking

Sound is the most natural, powerful and universal media for wireless communication. Therefore, we would like to build a Acoustic Communication/Networking system to directly transmit messages through the air.

  • In-Door Navigation

Indoor navigation and its application is a hot research topic recently. We want to combine Air-Beacon and internet information retrieval to built a in-door navigation system for Location-Based Service (LBS).

  • Cross-Platfrom Acoustic Communication

Sound is an universal media for broadcasting or exchanging messages between different platforms, for example, between iOS/Android smart TVs, tablets and mobile phones.