Jang Kangwook

Research

Research Interests

Model Compression via Knowledge Distillation/Pruning/Quantization
Self-Supervised Learning in Speech and Audio-Visual Modality
SpeechLLM

Publications

Below you can find the TL;DRs for my (co-)first-authored papers only. * denotes equal contribution.

[International Conference]

[1] Kim, S., Cho, S., Bae, S., Jang, K., Yun S., “Multi-Task Corrupted Prediction for Learning Robust Audio-Visual Speech Representation,” To be appeared in Proc. ICLR, 2025.

[2] Kim, M., Jang, K., & Kim, H., “Improving Cross-Lingual Phonetic Representation of Low-Resource Languages Through Language Similarity Analysis,” To be appeared in Proc. ICASSP, IEEE, 2025.

[3] Kim, S.*, Jang, K.*, & Bae, S., Kim, H., Yun, S.,“Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition,” in Proc. SLT Workshop, IEEE, 2024.
(To be published. See you in Macau.)

Our paper suggests the effectiveness of enhancing visual information in audio-visual speech recognition (AVSR) task. We propose three loss functions, related to the temporal dynamics in video data shown in the figure.

We adopt a cross-modal attention structure when fusing two modalities so that each modality references the information of the other modality through the attention mechanism.

[Paper] [Code]

[4] Kim, H., Jang, K., & Kim, H.,“One-Class Learning with Adaptive Centroid Shift for Audio Deepfake Detection,” in Proc. Interspeech, ISCA, 2024. (Oral) [Paper]

[5] Jang, K., Kim, S., & Kim, H.,“STaR: Distilling Speech Temporal Relation for Lightweight Speech Self-Supervised Learning Models,” in Proc. ICASSP, IEEE, 2024. (Best Student Paper Awarded.) 😄🎉

Our paper suggests to distill speech temporal relation rather than direct representations of the teacher model. Three objectives are explored: average attention map, layer-wise temporal Gram matrix (TGM), intra-layer TGM. TGM is calculated by aggregating the channel (embedding) dimension so that the resulting matrix is shaped as sequence length by sequence length.

After some inspections, STaR loss is defined as the combination of layer-wise TGM and intra-layer TGM distillation. Distilled from HuBERT Base, our STaRHuBERT shows the best results among ARMHuBERT and DPHuBERT.

[Paper] [Code]

[6] Jang, K.*, Kim, S.*, Yun, S. Y., & Kim, H., “Recycle-and-Distill: Universal Compression Strategy for Transformer-based Speech SSL Models with Attention Map Reusing and Masking Distillation,” in Proc. Interspeech, ISCA, 2023.
(Also presented at NeurIPS 2023 Workshop: Self-Supervised Learning Theory and Practice. [Link])

Our paper suggests to distill the speech self-supervised learning models through a novel masking distillation, which considers the masked- and unmasked-frame separately.

Besides that, we apply the attention map reusing to further reduce the number of parameters and MACs in the student model. We can omit the calculations of the attention maps, also for the keys and queries.

[Paper] [Code]

[7] Lee, Y.*, Jang, K.*, Goo, J., Jung, Y., & Kim, H., “FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Learning,”, in Proc. Interspeech, ISCA, 2022.

Our paper suggests reducing the channel (embedding) dimension for the student model rather than reducing the number of Transformer layers. We also break down the bottleneck structures in Transformer since we process the 1D short input.

Keeping the layers the same, we implement the layer-wise knowledge distillation which can distill the intermediate layers. Time-reduction layer is also adopted to make the sequence length shorter.

[Paper] [Code]

[Domestic Conference]

[1] Jang, K., Shin, B., Kim, G., & Kim, H., “Analysis of Speech Tasks Advantageous for Screening Alzheimer-type Dementia and Mild Cognitive Impairment,”, in Proc. Conference on Electronics and Information Communications (CEIC), IEEE, 2023. (Best Paper Awarded.) 😄🎉

Our paper explores which speech task-related tasks play a crucial role for the diagnosis of Alzheimer's Disease and mild cognitive impairment. The results show that training a classifier using only mini-mental state examination-related recorded speech data leads to better results than utilizing the entire speech task including free talking or repeating the sentence.

[2] Goo, J., Jung, M., Jang, K., & Kim, H., “Unsupervised Domain Adaptation of Speech Recognition for Pseudo-military Environment,” in Proc. Korea Institute of Military Science and Technology, 2022.

Projects

[1] Compressing Speech Self-Supervised Learning Models for Automatic Speech Recognition.
May 2024 ~ Dec 2024.

[2] Development of Speech Sample Collection, Analysis, and Machine Learning-based Diagnosis Technology for Cognitive Disorders.
Jan 2021 ~ Dec 2023. With Jeonbuk University.

[3] Development of Speech Recognition/Synthesis Module Prototype.
Sep 2020 ~ Feb 2021. With Com2us.

Page updated

Google Sites

Report abuse