Home


Our Focus:  
(1) Spoken Language Understanding (ex. Speech Recognition/Synthesis and Natural Language Processing)
(2) Multimedia Signal Processing        (ex. Speech, Audio, Text, Image and Video Signal Processing)
(3) Machine Learning                            (ex. Deep Learning)

Spoken Language Understanding

  • Spoken Dialog System

*Demo: Voice Assisted Car Navigation System
Speech is the most natural, powerful and universal media for human-machine/computer communication. Our focus is to develop all necessary modules for Spoken Dialog System including robust speech, speaker and language recognition and natural speech synthesis.

  • Realtime Multilingual ASR

*Demo: Youtube
Android App with RealTime Multilingual (mixed Chinese/English) Speech Recognition Server based on:
1) Deep Neural Networks, 2) WebSocket/Http and 3) GStreamer.

  • Automatic Radio Transcriber

*Demo: Youtube or Google Drive *Online: Server
Automatic Broadcast Radio Subtitle Generation (audio to *.srt) using Deep Neural Networks. This is a real-time multilingual (mixed Chinese/English) speech recognizer. Our aim is to study semi-supervised/unsupervised training.

  • Large-Scale Radio Speech Corpus

The Radio Speech Corpus will consists more than 1,000 hours of audio recordings, provided by Taiwan’s
National Education Radio (NER) archive. This corpus provides a rich resource for research in speech and automatic speech recognition (ASR). It will be publicly released in near future.

  • Chinese-English MixTTS

*Demo: Google Drive
To approach the goal of establishing an End-to-End speech synthesis system, we propose to use character-level recurrent neural networks (RNNs) to directly convert input character sequences into latent linguistic feature vectors.

  • Automatic Sentiment Information Extraction
Automatic sentiment information extraction
of social network articles has many essential applications. Following the valence-arousal space framework, two approaches a neural network (NN) model that could predict the valence-arousal ratings of Chinese words.

  • Smart Home User Interface
*Demo:SmartTV - Voice and Gesture Control
Smart Home is a hot research and development (R&D) topic recently. Our focus is to build a spoken dialog-based Smart Housekeeper and make speech, speaker and language recognition more robust and useful for our daily life.

  • DAISY Digital Talking Book Player

We have developed a speech synthesis-enabled DAISY player. It is small, portable and designed for use by people with "print disabilities", including blindness, impaired vision, and dyslexia.

  • Speech Summarization, Analysis and Organization
(from http://diging.github.io/tethne)

Multimedia data (in Youtube, Live broadcast,...) are dramatically increasing recently. We therefore need a efficient way to browse, use and archive those speech data through speech/speaker recognition, story summarization and organization.

  • Computer-Assisted Language Learning

Secondary language learning is more and more important today (global village). Therefore, we would like to build a computer-assisted language learning system that could simultaneously detect pronunciation errors, speech prosody deviations and dialogue act mistakes.

Multimedia Signal Processing



  • Audio Event Tokenizer

To automatically extract speech segments from an audio, a deep convolutional neural network was applied. An audio tokenizer is then built to cut input audio files into speech, music and other segments and output *.srt files.

  • Distant Speech Recognition
Microphone Array

The performance of conventional Automatic Speech Recognition (ASR) systems degrades dramatically if the microphone is far away from the speakers. To alleviate this issue, an eight omni-directional electret condenser microphone array and recurrent neural networks-based algorithm are built to resist distortion of background noise, overlapping speech and reverberation. 

  • Acoustic Event Detection

Sound event detection is essential for advanced smart-home applications. We would like to build a system that integrates several Kinect One sensors for elderly care, baby monitor and especially, home security.

  • Microphone Array

Microphone array is the key to the success of mobile phone, Smart-TV and Smart-Home applications. Especially, a  good microphone array should not only remove background noise and also allow a speaker to freely move to any position.


  • Speech Enhancement
http://sip.sys.es.osaka-u.ac.jp/~kawamura/MAP/bkgrnd.jpg
(from http://sip.sys.es.osaka-u.ac.jp/~kawamura)

Speech enhancement/echo cancellation are important for good mobile communication. Especially, it should not distort speech or interrupt communication.

          Super Human-Machine
                Interface

  • Haptic Communication
(from http://api.ning.com/)

Haptic man-machine interface aims to augment the physical environment around us through haptic feedback to extend our sense (sixth sense or superhuman sense)


Machine Learning


  • Chinese-English MixASR
Our deep LSTM-based ASR achieved overall performance Rank2, English performance Rank1 in the extended submission of OC16 Chinese-English MixASR Challenge

  • Reference GPU Server for Machine Learning Research

Our workhorse: Karas + Tensorflow + Ubuntu 16.04 on Tyan FT77C-B7079 4U 8 Bay + ASUS ROG Strix GeForce GTX 1080 * 8

  • Language Recognition

Our gated DNN system for the NIST 2015 language recognition i-vector machine learning challenge. It was designed to solve the language clustering and out-of-set detection issues simultaneously. It achieves a relative performance gain of up to 51%, compared to the baseline cosine distance scoring (CDS) system provided by organizer.


  • Speaker Recognition

Example of FA-DNN outputs for speaker recognition: (a) original speaker i-vectors, (b) purified speaker i-vectors.

  • Factor Analysis Neural Networks

Although, deep neural networks (DNNs) are very powerful but it still could be easily affected by noises. We have developed a new factor analysis DNN (FA-DNN) structure and training algorithm that can successfully separate wanted signals and noises.


  • Hand Writing Recognition

Example of FA-DNN outputs for handwritten digit recognition: (a) original digits, (b) purified digits.


                 Wireless  
           Communication

  • Acoustic Communication/Networking

Sound is the most natural, powerful and universal media for wireless communication. Therefore, we would like to build a Acoustic Communication/Networking system to directly transmit messages through the air.

  • In-Door Navigation

Indoor navigation and its application is a hot research topic recently. We want to combine Air-Beacon and internet information retrieval to built a in-door navigation system for Location-Based Service (LBS).

  • Cross-Platfrom Acoustic Communication

Sound is an universal media for broadcasting or exchanging messages between different platforms, for example, between iOS/Android smart TVs, tablets and mobile phones.