Se Jin Park

I'm a final-year Ph.D. student at KAIST, Integrated Vision and Language Lab (IVLLab), advised by Professor Yong Man Ro. My research focuses on advancing multimodal human-AI interactions—integrating audio, vision, and text—based on Large Language Models. Specifically, I have worked on multimodal integration, unbounded generation, generation fidelity, and full-duplex behaviors for realistic and engaging human-AI dialogues.

Email: jinny960812@kaist.ac.kr

CV | Google Scholar | LinkedIn

Work Experience

Research Scientist Intern at Meta (Meta GenAI, Llama Speech)

Seattle, WA (June 2025 - Present)Supervised by Jinxi Guo and Naoyuki Kanda

Student Researcher at Google DeepMind (Foundational Research Unit)

Remote, (Oct 2024 - Dec 2024) Supervised by Julian Salazar and Keisuke Kinoshita. Paper: Long-Form Spoken Speech Generation with Spoken Language Models [paper | demo | data]

Student Researcher at Google DeepMind (Foundational Research Unit)

Mountain View, CA, (July 2024 - Oct 2024)Supervised by Julian Salazar and Aren Jansen.

Publications

<C: Conference, J: Journal, P: Preprint, ': Primary Contributors, *: Equal Contributors>
2025
[P5] SemProTokenizer: Streamlined Dual-Branch Speech Tokenizer for Spoken Language ModelsSe Jin Park, Bella Godiva, Jeonghun Yeo, Junil Won, and Yong Man RoUnder Review
[C11] Long-Form Speech Generation with Spoken Language ModelsSe Jin Park', Julian Salazar', Aren Jansen, Keisuke Kinoshita, Yong Man Ro, and RJ Skerry-Ryan International Conference on Machine Learning (ICML), Oral Presentation, 2025, [paper | demo | data]
[C10] MMS-LLaMA: Efficient LLM-based Audio-Visual Speech Recognition with Minimal Multimodal Speech Tokens Jeonghun Yeo*, Hyeongseop Rha*, Se Jin Park, and Yong Man Ro Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL Findings), 2025, [paper | code]
[P4] Empathetic Response in Audio-Visual Conversations Using Emotion Preference Optimization and MambaCompressorYeonju Kim, Se Jin Park, and Yong Man Ro Arxiv Preprint, 2025, [paper]
[P3] AV-EmoDialog: Chat with Audio-Visual Users Leveraging Emotional CuesSe Jin Park, Yeonju Kim, Hyeongseop Rha, Bella Godiva, and Yong Man Ro Arxiv Preprint, 2025, [paper]

2024
[C9] Efficient Training for Multilingual Visual Speech Recognition: Pre-training with Discretized Visual Speech RepresentationMinsu Kim*, Jeonghun Yeo*, Se Jin Park, Hyeongseop Rha, and Yong Man Ro The Association for Computing Machinery's Annual Conference on Multimedia, (ACMMM), 2024, [paper]
[C8] Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation Se Jin Park*, Chae Won Kim*, Hyeongseop Rha, Minsu Kim, Joanna Hong, Jeonghun Yeo, and Yong Man Ro Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) Oral Presentation, 2024, [paper | data | demo]Received Outstanding Paper Award
[C7] AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation Jeongsoo Choi*, Se Jin Park*, Minsu Kim*, and Yong Man Ro IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Highlight Presentation, 2024, [paper | demo | code]
[C6] Persona Extraction Through Semantic Similarity For Emotional Support Conversation Generation Seunghee Han, Se Jin Park, Chae Won Kim, and Yong Man RoIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, [paper]
[C5] Exploring Phonetic Context in Lip Movement for Authentic Talking Face GenerationSe Jin Park, Minsu Kim, Jeongsoo Choi, and Yong Man RoIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, [paper | demo]
[C4] Reprogramming Audio-driven Talking Face Synthesis into Text-drivenJeongsoo Choi, Minsu Kim, Se Jin Park, and Yong Man RoIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, [paper | demo]

2023
[C3] Intuitive Multilingual Audio-Visual Speech Recognition with a Single-Trained ModelJoanna Hong, Se Jin Park, and Yong Man RoFindings of the Conference on Empirical Methods in Natural Language Processing (EMNLP Findings), 2023, [paper]
[P2] DF-3DFace: One-to-Many Speech Synchronized 3D Facial Animation with DiffusionSe Jin Park, Joanna Hong, Minsu Kim, and Yong Man RoArxiv Preprint, 2023, [paper]

2022
[C2] SyncTalkFace: Talking Face Generation with Precise Lip-syncing via Audio-Lip MemorySe Jin Park, Minsu Kim, Joanna Hong, Jeongsoo Choi, and Yong Man RoAAAI Conference on Artificial Intelligence (AAAI) Oral Presentation, 2022, [paper]
[P1] Test-time Adaptation for Real Image Denoising via Meta-transfer LearningAgus Gunawan, Muhammad Adi Nugroho, and Se Jin ParkarXiv Preprint, 2022, [paper]

2021
[C1] Multi-Modality Associative Bridging Through Memory: Speech Sound Recollected From Face Video Speech Reconstruction with Reminiscent Sound via Visual Voice Memory Minsu Kim*, Joanna Hong*, Se Jin Park, Yong Man Ro IEEE/CVF International Conference on Computer Vision (ICCV), 2021, [paper]
[J1] Speech Reconstruction with Reminiscent Sound via Visual Voice MemoryJoanna Hong, Minsu Kim, Se Jin Park, Yong Man Ro IEEE/ACM Transactions on Audio, Speech, and Language Processing (TASLP), 2021, [paper]
[J2] Cromm-vsr: Cross-modal Memory Augmented Visual Speech RecognitionMinsu Kim, Joanna Hong, Se Jin Park, Yong Man Ro IEEE Transactions on Multimedia (TMM), 2021, [paper]

Honors & Awards

Outstanding Paper Award (2024)
Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL)
National Government Fellowship (2022-Present)
KAIST Fellowship (2020-2022)
National Science and Engineering Scholarship for Undergraduate Students (2015-2020)

Academic Services

Program Committee & Conference Reviewer

AAAI Conference on Artificial Intelligence (AAAI) (2023, 2024, 2025)
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024, 2025)
IEEE/CVF International Conference on Computer Vision (ICCV) (2025)
Audio-Visual Generation and Learning Workshop (ECCV) (2024)
The Association for Computing Machinery's Annual Conference on Multimedia, (ACMMM) (2025)
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2025)

Journal Reviewer

IEEE Transactions on Multimedia
IEEE Transactions of Affective Computing
IEEE Transactions on Image Processing
Journal of Natural Language Processing
Neural Processing Letters

Education

Korea Advanced Institute of Science and Technology (KAIST), South Korea (2022 - Present)

Ph.D in Electrical Engineering (advisor: Prof. Yong Man Ro)

Korea Advanced Institute of Science and Technology (KAIST), South Korea (2020 - 2022)

M.S in Electrical Engineering (advisor: Prof. Yong Man Ro)

Korea Advanced Institute of Science and Technology (KAIST), South Korea (2015 - 2020)

B.S in Electrical Engineering

Nanjing International School (NIS), China (2009 - 2015)

International Baccalaureate (IB) Diploma

Paparoa Street School, New Zealand (2006 - 2007)

Teaching Experience

- EE474 Introduction to Multimedia, KAIST (2022 Spring, 2023 Spring, 2024 Spring, 2025 Spring)
  - Lecturer of programming class: 2023 Spring, 2024 Spring, 2025 Spring
- EE305 Introduction to Electronics Design Lab, KAIST (2022 Fall, 2023 Fall)

Skills

Programming Languages

Python, C, and MATLAB

Framework

Pytorch, Jax/Flax, and Tensorflow

Languages

Korean, English, and Chinese

Page updated

Google Sites

Report abuse