Speech Recognition

(Click here for a list of our research themes.)

Speech recognition technology is used to estimate the spoken words in response to an input speech signal. We are focused on developing and refining this basic technology using pattern recognition and machine learning techniques, with a particular emphasis on creating technologies that remain robust in the face of environmental changes. 

Zero-Latency Streaming Speech Recognition

Highly accurate, low-latency streaming speech recognition that completes the recognition process at the end of a spoken sentence

Typically, speech recognition involves searching for the best combination of letters and words at the end of an utterance. While this method can result in highly accurate recognition, the performance can be improved by utilizing information slightly in the future to anticipate the search process. However, such an approach is not compatible with achieving rhythmic conversation in a spoken dialogue system. In a natural conversation, interruptions, back-channeling, and nods can occur based on the content of the other party's speech. The delay between the end of the speech and the recognition result can significantly impact the generation of a natural response. We are conducting research and development to realize streaming speech recognition that recognizes speech content in real-time and completes the recognition process accurately without delay at the end of the speech. 

Relevant Publications:

HC-CTC

Feature representation learning method that makes it possible to construct high-level (highly abstract) information by assembling low-level (less abstract) information

By conditioning low-granularity predictions on high-granularity predictions, the process of generating word sequences is explicitly learned. We expect that feature extraction for word prediction will be effectively learned by progressively increasing the abstraction level of linguistic information, as in the process of converting speech sounds to phonemes to words to text. 

Relevant Publications:

Mask-CTC

Feature representation learning method that can alleviate conditional independence constraints in Connectionist Temporal Classification (CTC)

We propose Mask-CTC as a feature representation learning method that reduces the conditional independence constraint of CTC by considering long-term context through multi-task learning of CTC and mask estimation. Mask estimation also introduces a mechanism (dynamic length prediction) that can robustly compensate for errors such as substitutions, insertions, and deletions. Furthermore, the feature representation that is aware of long-term context is advantageous for speech anticipation and has been proven effective for low-latency and high-accuracy streaming speech recognition. This research has been conducted in collaboration with Prof. Shinji Watanabe at Carnegie Mellon University. 

Relevant Publications:

Speech Recognition with Metacognitive Capabilities 

Computerized implementation of metacognitive functions, which are the ability to know whether or not a person knows something

Our goal is to develop a pattern recognition system that can achieve high performance, even for unknown inputs, by utilizing multi-stream pattern recognition. This approach involves selecting multiple complementary recognition systems based on the nature of the input data, using performance monitoring that corresponds to the metacognitive function. By doing so, we aim to create a pattern recognizer that is not solely reliant on data collection and can provide robust, high-performance results. Our research is being conducted in collaboration with Professor Hynek Hermansky at Johns Hopkins University. 

Relevant Publications:

Domain Adaptation and Generalization of Language Models

Methods to accurately acquire and effectively use feature representations related to vocabulary and context common to a domain

In situations where large training texts are unavailable in the domain of interest, modifying a language model built with large texts from different domains using a small amount of target domain texts (domain adaptation) can improve the model's performance. In this research, we are investigating domain adaptation techniques for recurrent neural network language models, specifically targeting multi-person dialogue speech recognition performance improvement. Our main focus is on developing methods to effectively acquire and use feature expressions related to vocabulary and context common to the domain with high accuracy, as well as methods for efficiently incorporating auxiliary information as input to the neural network. 

Relevant Publications:

Robust Feature Extraction (Disentangling)

Neural networks for separating and extracting complex mixtures of information such as speech content and speaker

Speech conveys not only the content information (what is said) but also the speaker information (who is spoken). In general, speech recognition (recognition of speech content) is not robust to differences in the speaker, and conversely, recognition of the speaker is not robust to differences in speech content. To address this, researchers have proposed acoustic features that are robust to speaker differences, such as RASTA-PLP and BNF, which have achieved high speech recognition performance. Interestingly, these features also improve speaker recognition accuracy even though speaker information is lost. To unravel this mystery, we are exploring the use of neural networks to extract and separate the complex mixture of information in speech, including speech content and speaker identity.

Relevant Publications:

Lombard Speech Recognition

Investigating the impact of the Lombard effect on speech recognition performance

When speaking in a noisy environment, it's common for people to raise their voices and speak at a higher pitch, a phenomenon known as the Lombard effect. In this study, we aimed to investigate how the Lombard effect impacts speech recognition performance. Typically, when evaluating speech recognition systems in noisy environments, researchers use a dry source recording of the speech, which is then convolved with the impulse response of the target environment and superimposed with noise to simulate real-life conditions. However, since dry source recordings do not capture the Lombard effect, they may not accurately represent the speech recognition performance of speech in noisy environments. To address this issue, we created a Lombard speech corpus that includes recordings of speech under various types of noise and noise levels.

Relevant Publications: