Our Focus:  
(1) Spoken Language Understanding (ex. Speech Recognition, Synthesis and Natural Language Processing)
(2) Multimedia Signal Processing (ex. Speech, Audio, Text, Image and Video Signal Processing)
(3) Machine Learning (ex. Deep Learning)

Spoken Language Understanding

  • Spoken Dialog System

Speech is the most natural, powerful and universal media for human-machine/computer communication. Our focus is to develop all necessary modules for Spoken Dialog System including robust speech, speaker and language recognition and natural speech synthesis.

  • Smart Home User Interface
Smart Home is a hot research and development (R&D) topic recently. Our focus is to build a spoken dialog-based Smart Housekeeper and make speech, speaker and language recognition more robust and useful for our daily life.

  • Chinese-English MixTTS

To approach the goal of establishing an EEnd-to-End speech synthesis system, we propose to use character-level recurrent neural networks (RNNs) to directly convert input character sequences into latent linguistic feature vectors.

  • DAISY Digital Talking Book Player

We have developed a speech synthesis-enabled DAISY player. It is small, portable and designed for use by people with "print disabilities", including blindness, impaired vision, and dyslexia.

  • Speech Summarization, Analysis and Organization
(from http://diging.github.io/tethne)

Multimedia data (in Youtube, Live broadcast,...) are dramatically increasing recently. We therefore need a efficient way to browse, use and archive those speech data through speech/speaker recognition, story summarization and organization.

  • Computer-Assisted Language Learning

Secondary language learning is more and more important today (global village). Therefore, we would like to build a computer-assisted language learning system that could simultaneously detect pronunciation errors, speech prosody deviations and dialogue act mistakes.

Multimedia Signal Processing

  • Audio Event Detection for Smart Home

Sound event detection is essential for advanced smart-home applications. We would like to build a system that integrates several Kinect One sensors for elderly care, baby monitor and especially, home security.

  • Microphone Array

Microphone array is the key to the success of mobile phone, Smart-TV and Smart-Home applications. Especially, a  good microphone array should not only remove background noise and also allow a speaker to freely move to any position.

  • Speech Enhancement
(from http://sip.sys.es.osaka-u.ac.jp/~kawamura)

Speech enhancement/echo cancellation are important for good mobile communication. Especially, it should not distort speech or interrupt communication.

          Super Human-Machine

  • Haptic Communication
(from http://api.ning.com/)

Haptic man-machine interface aims to augment the physical environment around us through haptic feedback to extend our sense (sixth sense or superhuman sense)

Machine Learning

  • Chinese-English MixASR
Our deep LSTM-based ASR achieved overall performance Rank2, English performance Rank1 in the extended submission of OC16 Chinese-English MixASR Challenge

  • Reference GPU Server for Machine Learning Research

Our workhorse: Karas + Tensorflow + Ubuntu 16.04 on Tyan FT77C-B7079 4U 8 Bay + ASUS ROG Strix GeForce GTX 1080 * 8

  • Language Recognition

Our gated DNN system forthe NIST 2015 language recognition i-vector machine learning challenge. It was designed to solve the language clustering and out-of-set detection issues simultaneously. It achieves a relative performance gain of up to 51%, compared to the baseline cosine distance scoring (CDS) system provided by organizer.

  • Speaker Recognition

Example of FA-DNN outputs for speaker recognition: (a) original speaker i-vectors, (b) purified speaker i-vectors.

  • Factor Analysis Neural Networks

Although, deep neural networks (DNNs) are very powerful but it still could be easily affected by noises. We have developed a new factor analysis DNN (FA-DNN) structure and training algorithm that can successfully separate wanted signals and noises.

  • Hand Writing Recognition

Example of FA-DNN outputs for hand writing digit recognition: (a) original digits, (b) purified digits.


  • Acoustic Communication/Networking

Sound is the most natural, powerful and universal media for wireless communication. Therefore, we would like to build a Acoustic Communication/Networking system to directly transmit messages through the air.

  • In-Door Navigation

Indoor navigation and its application is a hot research topic recently. We want to combine Air-Beacon and internet information retrieval to built a in-door navigation system for Location-Based Service (LBS).

  • Cross-Platfrom Acoustic Communication

Sound is an universal media for broadcasting or exchanging messages between different platforms, for example, between iOS/Android smart TVs, tablets and mobile phones.