Speaker Recognition

(Click here for a list of our research themes.)

Speaker recognition is a method used to determine the identity of the speaker in an input speech signal (i.e., who is speaking). When the technology is used to identify both the speaker and the time of speaking (i.e., who spoke and when), it is referred to as speaker dialization. We are working on enhancing this technology by employing pattern recognition and machine learning techniques.

Speaker Identification Using Crowdsourcing

Methodologies to improve the accuracy of speaker identification by utilizing crowdsourcing

Our current research focuses on exploring crowdsourcing methodologies to efficiently annotate speech data, enabling easy development of speaker recognition systems and enhancing the accuracy of speaker identification results. To achieve this, we are utilizing Tutti, a framework designed for leveraging crowdsourcing on Amazon Mechanical Turk.

Relevant Publications:

Yuta Ide, Susumu Saito, Teppei Nakano, Tetsuji Ogawa, ``Can humans correct errors from system? Investigating error tendencies in speaker identification using crowdsourcing,'' Proc. The 23rd Annual Conference of the International Speech Communication Association (INTERSPEECH2022), Sept. 2022. [DOI] [Scopus]
Susumu Saito, Yuta Ide, Teppei Nakano, Tetsuji Ogawa, ``VocalTurk: Exploring feasibility of crowdsourced speaker identification,'' Proc. The 22th Annual Conference of the International Speech Communication Association (INTERSPEECH2021), pp.1723-1727, Aug. 2021. [DOI] [Scopus]

Robust Feature Representation Learning for Speaker Identification

Techniques aimed at disentangling the complex mixture of information about speech content and speaker identity

Speech contains both the content of the speech (i.e., what is being said) and information about the speaker (i.e., who is speaking). Typically, speech recognition technology struggles with variations in speaker identity, while speaker recognition technology struggles with changes in speech content. As a result, we are working to develop a technique that can disentangle the complex mixture of information about speech content and speaker identity, enabling speaker identification even with short utterances.

Relevant Publications:

Naohiro Tawara, Atsunori Ogawa, Tomoharu Iwata, Marc Delcroix, Tetsuji Ogawa, ``Frame-level phoneme-invariant speaker embedding for text-independent speaker recognition on extremely short utterances,'' Proc. 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2020), pp.6799-6803, May 2020. [DOI] [Scopus]

Modeling for Speaker Clustering

Relevant Publications:

Naohiro Tawara, Tetsuji Ogawa, Shinji Watanabe, Atsushi Nakamura, Tetsunori Kobayashi, ``A sampling-based speaker clustering using utterance-oriented Dirichlet process mixture model and its evaluation on large-scale data,’’ APSIPA Trans. Signal & Info. Process., vol.4, Oct. 2015. [DOI] [Scopus]
Naohiro Tawara, Tetsuji Ogawa, Tetsunori Kobayashi, ``A comparative study of spectral clustering for i-vector-based speaker clustering under noisy conditions,’’ Proc. ICASSP2015, pp.2041-2045, April 2015. [DOI] [Scopus]

Page updated

Google Sites

Report abuse