Speech Recognition
(Click here for a list of our research themes.)
Speech recognition technology is used to estimate the spoken words in response to an input speech signal. We are focused on developing and refining this basic technology using pattern recognition and machine learning techniques, with a particular emphasis on creating technologies that remain robust in the face of environmental changes.
Zero-Latency Streaming Speech Recognition
Highly accurate, low-latency streaming speech recognition that completes the recognition process at the end of a spoken sentence
Typically, speech recognition involves searching for the best combination of letters and words at the end of an utterance. While this method can result in highly accurate recognition, the performance can be improved by utilizing information slightly in the future to anticipate the search process. However, such an approach is not compatible with achieving rhythmic conversation in a spoken dialogue system. In a natural conversation, interruptions, back-channeling, and nods can occur based on the content of the other party's speech. The delay between the end of the speech and the recognition result can significantly impact the generation of a natural response. We are conducting research and development to realize streaming speech recognition that recognizes speech content in real-time and completes the recognition process accurately without delay at the end of the speech.
Relevant Publications:
Huaibo Zhao, Shinya Fujie, Tetsuji Ogawa, Jin Sakuma, Yusuke Kida, Tetsunori Kobayashi, ``Conversation-oriented ASR with multi-look-ahead CBS architecture,'' Proc. 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2023), June 2023.
HC-CTC
Feature representation learning method that makes it possible to construct high-level (highly abstract) information by assembling low-level (less abstract) information
By conditioning low-granularity predictions on high-granularity predictions, the process of generating word sequences is explicitly learned. We expect that feature extraction for word prediction will be effectively learned by progressively increasing the abstraction level of linguistic information, as in the process of converting speech sounds to phonemes to words to text.
Relevant Publications:
Mask-CTC
Feature representation learning method that can alleviate conditional independence constraints in Connectionist Temporal Classification (CTC)
We propose Mask-CTC as a feature representation learning method that reduces the conditional independence constraint of CTC by considering long-term context through multi-task learning of CTC and mask estimation. Mask estimation also introduces a mechanism (dynamic length prediction) that can robustly compensate for errors such as substitutions, insertions, and deletions. Furthermore, the feature representation that is aware of long-term context is advantageous for speech anticipation and has been proven effective for low-latency and high-accuracy streaming speech recognition. This research has been conducted in collaboration with Prof. Shinji Watanabe at Carnegie Mellon University.
Relevant Publications:
Huaibo Zhao, Yosuke Higuchi, Tetsuji Ogawa, Tetsunori Kobayashi, ``An investigation of enhancing CTC model for triggered attention-based streaming ASR,'' Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference 2021 (APSIPA2021), pp.477-483, Dec. 2021. [URL] [Scopus]
Yosuke Higuchi, Hirofumi Inaguma, Shinji Watanabe, Tetsuji Ogawa, Tetsunori Kobayashi, ``Improved Mask-CTC for non-autoregressive end-to-end ASR,'' Proc. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2021), pp.8363-8367, June 2021. [DOI]
Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, Tetsunori Kobayashi, ``Mask CTC: Non-autoregressive end-to-end ASR with CTC and mask predict,'' Proc. The 21th Annual Conference of the International Speech Communication Association (INTERSPEECH2020), pp.3655-3659, Oct. 2020. [DOI]
Speech Recognition with Metacognitive Capabilities
Computerized implementation of metacognitive functions, which are the ability to know whether or not a person knows something
Our goal is to develop a pattern recognition system that can achieve high performance, even for unknown inputs, by utilizing multi-stream pattern recognition. This approach involves selecting multiple complementary recognition systems based on the nature of the input data, using performance monitoring that corresponds to the metacognitive function. By doing so, we aim to create a pattern recognizer that is not solely reliant on data collection and can provide robust, high-performance results. Our research is being conducted in collaboration with Professor Hynek Hermansky at Johns Hopkins University.
Relevant Publications:
Tetsuji Ogawa, Harish Mallidi, Emmanuel Dupoux, Jordan Cohen, Naomi Feldman, Hynek Hermansky, ``A new efficient measure for accuracy prediction and its application to multistream-based unsupervised adaptation,’’ Proc. ICPR2016, pp.2222-2227, Dec. 2016. [DOI] [Scopus]
Sri Harish Mallidi, Tetsuji Ogawa, Hynek Hermansky, ``Uncertainty estimation of DNN classifiers,’’ Proc. ASRU2015, pp.283-288, Dec. 2015. [DOI] [Scopus]
Domain Adaptation and Generalization of Language Models
Methods to accurately acquire and effectively use feature representations related to vocabulary and context common to a domain
In situations where large training texts are unavailable in the domain of interest, modifying a language model built with large texts from different domains using a small amount of target domain texts (domain adaptation) can improve the model's performance. In this research, we are investigating domain adaptation techniques for recurrent neural network language models, specifically targeting multi-person dialogue speech recognition performance improvement. Our main focus is on developing methods to effectively acquire and use feature expressions related to vocabulary and context common to the domain with high accuracy, as well as methods for efficiently incorporating auxiliary information as input to the neural network.
Relevant Publications:
Naohiro Tawara, Atsunori Ogawa, Tomoharu Iwata, Hiroto Ashikawa, Tetsunori Kobayashi, Tetsuji Ogawa, ``Multi-source domain generalization using domain attributes for recurrent neural network language models,'' IEICE Trans. Inf. & Syst., vol.E105-D, no.1, pp.150-160, Jan. 2022. [DOI] [Scopus]
Tsuyoshi Morioka, Naohiro Tawara, Tetsuji Ogawa, Atsunori Ogawa, Tomoharu Iwata, Tetsunori Kobayashi, ``Language model domain adaptation via recurrent neural network with domain-shared and domain-specific representations,’’ Proc. ICASSP2018, pp.6084-6088, April 2018. [DOI] [Scopus]
Hiroto Ashikawa, Naohiro Tawara, Asunori Ogawa, Tomoharu Iwata, Tetsunori Kobayashi, Tetsuji Ogawa, ``Exploiting end of sentences and speaker alternations in recurrent neural network-based language modeling for multiparty conversations,'' Proc. APSIPA2017, Dec. 2017. [DOI] [Scopus] [Poster Book Prizes]
Robust Feature Extraction (Disentangling)
Neural networks for separating and extracting complex mixtures of information such as speech content and speaker
Speech conveys not only the content information (what is said) but also the speaker information (who is spoken). In general, speech recognition (recognition of speech content) is not robust to differences in the speaker, and conversely, recognition of the speaker is not robust to differences in speech content. To address this, researchers have proposed acoustic features that are robust to speaker differences, such as RASTA-PLP and BNF, which have achieved high speech recognition performance. Interestingly, these features also improve speaker recognition accuracy even though speaker information is lost. To unravel this mystery, we are exploring the use of neural networks to extract and separate the complex mixture of information in speech, including speech content and speaker identity.
Relevant Publications:
Yosuke Higuchi, Naohiro Tawara, Tetsunori Kobayashi, Tetsuji Ogawa, ``Speaker adversarial training of DPGMM-based feature extractor for zero-resource languages,'' Proc. INTERSPEECH2019, pp.266-270, Sept. 2019. [DOI] [Scopus]
Taira Tsuchiya, Naohiro Tawara, Tetsunori Kobayashi, Tetsuji Ogawa, ``Speaker invariant feature extraction for zero-resource languages with adversarial training,’’ Proc. ICASSP2018, pp.2381-2385, April 2018. [DOI] [Scopus]
Lombard Speech Recognition
Investigating the impact of the Lombard effect on speech recognition performance
When speaking in a noisy environment, it's common for people to raise their voices and speak at a higher pitch, a phenomenon known as the Lombard effect. In this study, we aimed to investigate how the Lombard effect impacts speech recognition performance. Typically, when evaluating speech recognition systems in noisy environments, researchers use a dry source recording of the speech, which is then convolved with the impulse response of the target environment and superimposed with noise to simulate real-life conditions. However, since dry source recordings do not capture the Lombard effect, they may not accurately represent the speech recognition performance of speech in noisy environments. To address this issue, we created a Lombard speech corpus that includes recordings of speech under various types of noise and noise levels.
Relevant Publications:
Tetsuji Ogawa, Takanobu Nishiura, Takeshi Yamada, Norihide Kitaoka, and Tetsunori Kobayashi, ``Development and evaluation of Japanese Lombard speech corpus,’’ Proc. Internoise2011, Sept. 2011. [Scopus] [Invited talk in Special Session]
Tetsuji Ogawa, Tetsunori Kobayashi, ``Influence of Lombard effect: accuracy analysis of simulation-based assessments of noisy speech recognition systems for various recognition conditions,’’ IEICE Trans. Inf. & Syst., vol.E92-D, no.11, pp.2244-2252, Nov. 2009. [IEICE] [Scopus]