Chia-Wen is currently a research scientist at ByteDance Intelligent Editing Team led by Longyin Wen and Xiaohui Shen. He is the core contributor to the Vidi project for long-form video understanding, including spatial-temporal localization, video highlighting, and question answering, capable of processing joint visual and audio inputs.
Chia-Wen earned his Ph.D. at Georgia Tech. He joined the RobotIcs Perception and Learning (RIPL) lab in 2017, led by Dr. Zsolt Kira. During his Ph.D., Chia-Wen focused on deep learning in the field of vision-and-language (VL) research. His Ph.D. dissertation was centered on developing efficient and effective ways to integrate external knowledge into VL models and tasks. Prior to joining Georgia Tech, he worked with Dr. Yu-Chiang Frank Wang on 3D shape generation and pose estimation. Before that, he earned his Master's and Bachelor's degrees in Electrical Engineering at National Taiwan University.
(10/25-30/2025) 🏖️ Attend ICCV-25 in Honolulu
(07/23/2025) 📜 D-Attn paper accepted in ICCV-25
(07/16/2025) 📹 Vidi for long-form video understanding
(02/26/2024) Research scientist at ByteDance
(12/15/2023) 👨🎓 Graduate from Georgia Tech
(11/29/2023) 🎓 Pass Ph.D. Defense
(07/10/2023) Code release for HAAV [GitHub]
(06/18-22/2023) Attend CVPR-23 in Vancouver
(04/18/2023) 🎓 Pass Ph.D. Proposal Exam
(02/27/2023) 📜 HAAV paper accepted in CVPR-23
ByteDance Intelligent Editing Team (core contributor)
We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understanding and editing (VUE) scenarios. The first release focuses on temporal retrieval (TR), i.e., identifying the time ranges in input videos corresponding to a given text query.
[ICCV 2025] Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen
We propose decomposed attention (D-Attn), a Large Vision-and-Language Model with linear computational complexity for the vision modality and stronger VL capability.
[ICLR 2021] Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, Peter Vajda
We propose Unbiased Teacher using focal loss and teacher-student training to reduce the bias in pseudo-labeling for semi-supervised object detection.
Feb'24 - Present
Core contributor to the Vidi project for long-form video understanding, including spatial-temporal localization, video highlighting, and question answering, capable of processing joint visual and audio inputs.
Lead in developing model architecture, building training codebase, and scaling up model and data.
Work with Longyin Wen and Xiaohui Shen.
Aug '22 - Aug '23
Role: Research intern
Advisor: Chunyuan Li; Collaborator: Jianwei Yang
Project: Knowledge augmentation for large-scale pre-trained vision-and-language models.
May '21 - Aug '21
Role: Applied scientist intern
Advisor: Yuting Zhang
Project: Leveraging cross-modal pre-trained models for vision-and-language tasks.
May '20 - Aug '20
Role: Research intern
Advisor: Zeki Yalniz
Project: Learned data augmentation for self-supervised learning.
Jan '17 – Aug '17
Role: Research assistant
Advisor: Yu-Chiang Frank Wang
Project: 3D object reconstruction and pose estimation via generative models.
Aug '17 - Dec '23
Major: Robotics, working on computer vision and deep learning
Advisor: Dr. Zsolt Kira
Research interests: vision and language; knowledge augmentation; learning with less supervision.
Sep '13 - Jun '15
Major: Electrical Engineering
Advisor: Prof. Ren C. Luo
Thesis title: Robot Integrated 3D Object Recognition and Fetching System for Factory Automation
Sep '09 - Jun '13
Major: Electrical Engineering