Dongxu Li (last updated Feb 25, 2022)

dongxuli1005 [at] gmail [dot] com


About

I am a researcher working on multimodal generative AI. 

I obtained my Ph.D.  and Bachelor degrees from The Australian National University (ANU), both in computing.  My main research interest is in vision-and-language, multimodal representation learning.

Selected Recent Publications

(See Google Scholar for complete list of recent publications)

[preprint 24'] "Aria: An Open Multimodal Native Mixture-of-Experts Model

Dongxu Li, Yudong Liu, Haoning Wu, Yue Wang, Zhiqi Shen, Bowen Qu, Xinyao Niu, Guoyin Wang, Bei Chen, Junnan Li

[paper] [🤗model][code]


TL;DR - A comprehensive benchmark for long-form video understanding using referring reasoning questions, revealing significant gap between open-source and proprietary models.

[NeurIPS 24'] "LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding

Haoning Wu, Dongxu Li*, Bei Chen, Junnan Li  (*primary senior contribution)

[paper] [code]


TL;DR - A comprehensive benchmark for long-form video understanding using referring reasoning questions, revealing significant gap between open-source and proprietary models.

[NeurIPS 23'] "BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing

Dongxu Li, Junnan, Steven Hoi.

[paper] [code]


TL;DR - The first diffusion model with built-in multimodal control, x20 speedup than DreamBooth, enabling zero-shot subject-driven generation and editing.

[NeurIPS 23'] "InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai*, Junnan Li*, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. (* equal contribution)

[paper] [code


TL;DR - An instruction-tuned foundational multimodal model, substantially outperforming BLIP-2 and the larger Flamingo models.  The model also leads to state-of-the-art performance when finetuned on individual downstream tasks.

[ICML 23'] "BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models

Junnan Li, Dongxu Li, Silvio Savarese, Steven Hoi.

[paper] [code] [blog]


TL;DR - BLIP-2 leverages frozen pre-trained image encoders and large language models (LLMs) by training a lightweight, 12-layer Transformer encoder in between them, achieving state-of-the-art performance on various vision-language tasks. Most notably, BLIP-2 improves upon Flamingo, an 80 billion parameter model, by 8.7% on zero-shot VQAv2 with 54x fewer trainable parameters.

[ACL Demo 23'] "LAVIS: A library for language-vision intelligence

Dongxu Li, Junnan Li, Hung Le, Guangsen Wang, Silvio Savarese, Steven CH Hoi.

[paper] [code] [blog]


TL;DR - A one-stop open-source deep learning library for language-vision research and applications, offering comprehensive support for a wide range of tasks, datasets, and state-of-the-art model.


9K+ stars, 900+ forks on Github, 500K+ PyPI downloads.

[ICML 22'] "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Junnan Li, Dongxu Li, Caiming Xiong, Steven CH Hoi. 

[paper] [code] [blog] [demo] [zhihu]


TL;DR - A new vision-language foundation model, trained on self-bootstrapped captions. SoTA on 7 downstream image/video-text understanding and generation tasks.

[CVPR 22'] "Align and Prompt: Video-and-Language Pre-training with Entity Prompts"  

Dongxu Li, Junnan Li, Hongdong Li,  Juan Carlos Niebles, Steven CH Hoi. 

[paper] [code] [zhihu]


TL;DR - A new video-language pre-training technique, capturing fine-grained video information. SoTA on 4 text-to-video retrieval and videoQA datasets.