Speech and Language AI
Understanding the Role of Self-attention for Efficient Speech Recognition [paper]
Kyuhong Shim, Jungwook Choi, Wonyong Sung
ICLR 2022 Spotlight
In this paper, we reveal that self-attention in speech recognition models performs two distinct roles: phonetic and linguistic localization. Especially, we show that phonetic localization in the lower layers extracts phonologically meaningful features from speech and reduces the phonetic variance in the utterance for proper linguistic localization in the upper layers. From this understanding, we discover that attention maps can be reused as long as their localization capability is preserved. We propose the layer-wise attention map reuse technique and achieve 2x speedup without performance loss.
Leveraging Adapter for Parameter-Efficient ASR Encoder [paper]
Kyuhong Shim, Jinkyu Lee, Hyunjae Kim
INTERSPEECH 2024
We propose a novel architecture for efficient speech recognition. The model utilizes the parameter-sharing technique and a unique adapter module for Conformer-based ASR encoders. The proposed model reduces approximately 50% of parameters and 20% of computations.
Knowledge Distillation from Non-streaming to Streaming ASR Encoder using Auxiliary Non-streaming Layer [paper]
Kyuhong Shim, Jinkyu Lee, Simyung Chang, Kyuwoong Hwang
INTERSPEECH 2023
To improve the performance of streaming ASR, we propose a layer-to-layer knowledge distillation (KD) from non-streaming teacher's encoder to streaming student's encoder.
Efficient Deep Learning Training and Inference
InfiniPot: Infinite Context Processing on Memory-Constrained LLMs [paper]
Minsoo Kim, Kyuhong Shim, Jungwook Choi*, Simyung Chang*
EMNLP 2024
Handling long input contexts remains a significant challenge for Large Language Models (LLMs), particularly in resource-constrained environments such as mobile devices. Our work aims to address this limitation by introducing InfiniPot, a novel KV cache control framework designed to enable pre-trained LLMs to manage extensive sequences within fixed memory constraints efficiently, without requiring additional training.
Crayon: Customized On-Device LLM via Instant Adapter Blending and Edge-Server Hybrid Inference [paper]
Jihwan Bang*, Juntae Lee*, Kyuhong Shim, Seunghan Yang, Simyung Chang
ACL 2024
The customization of large language models (LLMs) for user-specified tasks becomes important. Yet, the performance of on-device LLMs is inherently constrained by the limitations of small-scale models. To overcome these restrictions, we propose Crayon, a novel approach for on-device LLM customization that 1) instantly blends adapters into a customized adapter without extra training, and 2) adopts device-server hybrid inference.
Multimodal (Speech, Language, and Vision) AI
Visually Guided Decoding: Gradient-free Hard Prompt Inversion with Language Models [paper]
Donghoon Kim, Minji Bae, Kyuhong Shim*, Byonghyo Shim*
ICLR 2025
We introduce Visually Guided Decoding (VGD), a gradient-free approach that leverages LLMs and CLIP-based guidance to generate coherent and semantically aligned prompts for text-to-image (T2I) generative models.
Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP [paper]
Eunji Kim, Kyuhong Shim, Simyung Chang, Sungroh Yoon
EMNLP 2024 Findings
We propose a framework of Semantic Token Reweighting to build Interpretable text embeddings (SToRI), facilitating the interpretative analysis of vision tasks through natural language. SToRI can be used for few-shot image classification, data-driven insight extraction, and image retrieval tailored to user preferences.