Chia-Wen Kuo (郭佳文)

Research Focus: large multi-modal models (LMM); video, audio, and image understanding; efficient model architecture and training; learning with synthetic data

Contact: albert.cwkuo[AT]gmail.com

Chia-Wen is currently a research scientist at ByteDance Intelligent Editing Team led by Longyin Wen and Xiaohui Shen. He is the core contributor to the Vidi project for long-form video understanding, including spatial-temporal localization, video highlighting, and question answering, capable of processing joint visual and audio inputs.

Chia-Wen earned his Ph.D. at Georgia Tech. He joined the RobotIcs Perception and Learning (RIPL) lab in 2017, led by Dr. Zsolt Kira. During his Ph.D., Chia-Wen focused on deep learning in the field of vision-and-language (VL) research. His Ph.D. dissertation was centered on developing efficient and effective ways to integrate external knowledge into VL models and tasks. Prior to joining Georgia Tech, he worked with Dr. Yu-Chiang Frank Wang on 3D shape generation and pose estimation. Before that, he earned his Master's and Bachelor's degrees in Electrical Engineering at National Taiwan University.

Google Scholar

News

(10/25-30/2025) 🏖️ Attend ICCV-25 in Honolulu
(07/23/2025) 📜 D-Attn paper accepted in ICCV-25
(07/16/2025) 📹 Vidi for long-form video understanding
(02/26/2024) Research scientist at ByteDance
(12/15/2023) 👨‍🎓 Graduate from Georgia Tech
(11/29/2023) 🎓 Pass Ph.D. Defense
(07/10/2023) Code release for HAAV [GitHub]
(06/18-22/2023) Attend CVPR-23 in Vancouver
(04/18/2023) 🎓 Pass Ph.D. Proposal Exam
(02/27/2023) 📜 HAAV paper accepted in CVPR-23

Projects & Publications

Vidi: Large Multimodal Models for Video Understanding and Editing

ByteDance Intelligent Editing Team (core contributor)

We introduce Vidi, a family of Large Multimodal Models (LMMs) for a wide range of video understanding and editing (VUE) scenarios. The first release focuses on temporal retrieval (TR), i.e., identifying the time ranges in input videos corresponding to a given text query.

[Paper] [GitHub] [Project] [Demo]

D-Attn: Decomposed Attention for Large Vision-and-Language Models

[ICCV 2025] Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen

We propose decomposed attention (D-Attn), a Large Vision-and-Language Model with linear computational complexity for the vision modality and stronger VL capability.

[Paper] [GitHub]

HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning

[CVPR 2023] Chia-Wen Kuo and Zsolt Kira

We propose HAAV to efficiently and effectively leverage multiple sources of knowledge for image captioning.

[Paper] [GitHub] [Project] [Video]

Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image

[CVPR 2022] Chia-Wen Kuo and Zsolt Kira

We propose Xmodal-Ctx to incorporate text descriptions of the input image via CLIP cross-modal retrieval for image captioning.

[Paper] [GitHub] [Project] [Video]

Unbiased Teacher for Semi-Supervised Object Detection

[ICLR 2021] Yen-Cheng Liu, Chih-Yao Ma, Zijian He, Chia-Wen Kuo, Kan Chen, Peizhao Zhang, Bichen Wu, Zsolt Kira, Peter Vajda

We propose Unbiased Teacher using focal loss and teacher-student training to reduce the bias in pseudo-labeling for semi-supervised object detection.

[arXiv] / [GitHub] / [Project]

FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning

[ECCV 2020] Chia-Wen Kuo, Chih-Yao Ma, Jia-Bin Huang, Zsolt Kira

We propose FeatMatch using feature-based augmentation for semi-supervised image classification.

[arXiv] / [GitHub] / [Project]

AI/ML Experiences

Feb'24 - Present

Intelligent Editing Team @ ByteDance

Core contributor to the Vidi project for long-form video understanding, including spatial-temporal localization, video highlighting, and question answering, capable of processing joint visual and audio inputs.

Lead in developing model architecture, building training codebase, and scaling up model and data.

Work with Longyin Wen and Xiaohui Shen.