Dongxu Li (last updated Feb 25, 2022)
dongxuli1005 [at] gmail [dot] com
About
I am a researcher working on multimodal generative AI.
I obtained my Ph.D. and Bachelor degrees from The Australian National University (ANU), both in computing. My main research interest is in vision-and-language, multimodal representation learning.
Selected Recent Publications
(See Google Scholar for complete list of recent publications)
[preprint 24'] "Aria: An Open Multimodal Native Mixture-of-Experts Model"
Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, Junnan Li
TL;DR - A comprehensive benchmark for long-form video understanding using referring reasoning questions, revealing significant gap between open-source and proprietary models.
[NeurIPS 24'] "LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding"
Haoning Wu, Dongxu Li*, Bei Chen, Junnan Li (*primary senior contribution)
TL;DR - A comprehensive benchmark for long-form video understanding using referring reasoning questions, revealing significant gap between open-source and proprietary models.
[NeurIPS 23'] "BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing"
Dongxu Li, Junnan, Steven Hoi.
TL;DR - The first diffusion model with built-in multimodal control, x20 speedup than DreamBooth, enabling zero-shot subject-driven generation and editing.
[NeurIPS 23'] "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning"
Wenliang Dai*, Junnan Li*, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. (* equal contribution)
TL;DR - An instruction-tuned foundational multimodal model, substantially outperforming BLIP-2 and the larger Flamingo models. The model also leads to state-of-the-art performance when finetuned on individual downstream tasks.
[ICML 23'] "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models"
Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi.
TL;DR - BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon Flamingo, an 80 billion parameter model, by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.
[ACL Demo 23'] "LAVIS: A library for language-vision intelligence"
Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, Steven CH Hoi.
TL;DR - A one-stop open-source deep learning library for language-vision research and applications, offering comprehensive support for a wide range of tasks, datasets, and state-of-the-art model.
9K+ stars, 900+ forks on Github, 500K+ PyPI downloads.
[ICML 22'] "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation"
Junnan Li, Dongxu Li, Caiming Xiong, Steven CH Hoi.
[paper] [code] [blog] [demo] [zhihu]
TL;DR - A new vision-language foundation model, trained on self-bootstrapped captions. SoTA on 7 downstream image/video-text understanding and generation tasks.
[CVPR 22'] "Align and Prompt: Video-and-Language Pre-training with Entity Prompts"
Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, Steven CH Hoi.
TL;DR - A new video-language pre-training technique, capturing fine-grained video information. SoTA on 4 text-to-video retrieval and videoQA datasets.