Kong Aik LEE, Ph.D.

Research Projects

Xi-vector Speaker Embedding

We present a Bayesian formulation for deep speaker embedding, wherein the xi-vector is the Bayesian counterpart of the x-vector, taking into account the uncertainty estimate. It offers a simple and straightforward extension to the now widely used x-vector by integrating the uncertainty modeling of the i-vector. Hence, we refer to the embedding as the xi-vector, which is pronounced as /zai/ vector.

[Paper: SPL] [3rd-Party Code]

ASVspoof 2021

Just like all other biometric systems, Automatic speaker veriﬁcation (ASV) is vulnerable to spooﬁng, primarily, replay, speech synthesis, and voice conversion attacks. These vulnerabilities call for spooﬁng countermeasures or presentation attack detection (PAD) systems. The ASVspoof challenge initiative was created to foster research on anti-spooﬁng and to provide common platforms for the assessment and comparison of spooﬁng countermeasures. I am proud to play a part in the organization of ASVspoof series of challenges.

[Challenge website: ASVspoof.org]

[The ASVspoof Papers: Attacker, Defender, t-DCF]

Neural i-vector

Our attempt to boost the performance of i-vector embedding with deep learning. I-vector is widely used in speaker recognition, speech recognition, speech synthesis, and other speech applications. Our results indicate that neural i-vector outperforms all the existing i-vector variants by a wide margin, indicating the importance of using speaker-informative short-term features and speaker-informative dictionary.

[Paper: Odyssey 2020] [Code: Pytorch]

GPU i-vector

Unleashing the unused potential of i-vectors by GPU acceleration! We achieve an acceleration of 25 times compared to the CPU baseline. This speed-up allows the exploration of ideas that were hitherto impossible. We found that it is beneficial to update the universal background model (UBM) and re-compute frame alignments while training the i-vector extractor.

[Paper: Interspeech 2019] [Code: Pytorch]

CORAL+

Domain adaptation in action! It is common that the domains (e.g., language, demographic) in which a speaker recognition system is deployed differs from that we trained it. CORAL+ was designed to bridge gap. It is an unsupervised adaptation algorithm that learns from a small amount of unlabeled in-domain data.

[Paper: ICASSP 2019] [Presentation: Slide] [3rd-Party Code]

Attention in speaker embedding - What does it learn?

Attention mechanism has been found to be a powerful representation learning technique in deep speaker embedding. It is intuitive to conjecture that frames receiving higher attention weights correspond to certain phonetic classes (e.g., vowels) which are more effective or useful to discriminating among speakers. Another line of thought has suggested that the attention weight might be associated with just simple speech versus non-speech classes.

[Paper: SLT 2018]

Project RedDots

Text-dependent speaker recognition has long been recognized as the solution to short-term voice authentication, i.e., speaker recognition under conditions where test utterances are of short duration and of variable phonetic content. The ultimate solution lies at efficient segregation and factorization of spoken utterances into components, which remains a challenging problem.

[Paper: Interspeech 2015, Platform] [Dataset: RedDots]