Find the full list & our latest work at [Google Scholar]

Representative Papers

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model.

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou.

CVPR 2024. [arxiv] [project page] [code]

10K+ GitHub Stars.

All in One: Exploring Unified Video-Language Pre-training. 

Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

CVPR 2023. [arxiv] [code]

The 1st unified video-language large-scale pretrained model.

Egocentric Video-Language Pretraining. (a.k.a. EgoVLP)

Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou.

NeurIPS 2022. [arxiv] [project page] [code] 

Spotlight, 1.7% acceptance rate. The 1st work to pioneer vision-language foundation model for egocentric video.

 

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. 

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

ICCV 2023. [arxiv] [project page] [code]

Pioneering efficient video diffusion model. 1 GPU 10mins training.

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos.

Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, Shih-Fu Chang.

CVPR 2017. [arxiv] [project page] [code

oral presentation, acceptance rate 2.6%, best student paper nomination.

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs. 

Zheng Shou, Dongang Wang, Shih-Fu Chang.

CVPR 2016. [arxiv] [code]

A pioneering work that proposes the first deep learning framework for temporal action localization in video.

Full List

2024

Bootstrapping SparseFormers from Vision Foundation Models. 

Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

VideoLLM-online: Towards Large Video-Language Model for Streaming Video

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code] [website]

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives

Grauman et al. [ Singapore Site Author List: Joya Chen, Jia-Wei Liu, Xinzhu Fu, Chenan Song, Mike Zheng Shou. ]

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code] Oral

Adding Universal Compatibility of Plugins for Upgraded Diffusion Model

Lingmin Ran, Xiaodong Cun, Jia-Wei Liu, Rui Zhao, Song Zijie, Xintao Wang, Jussi Keppo, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream

Jingtao Sun, Yaonan Wang, Mingtao Feng, Yulan Guo, Ajmal Saeed Mian, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

Tune-An-Ellipse: CLIP Has Potential to Find What You Want

Jinheng Xie, Songhe Deng, Bing Li, Haozhe Liu, Yawen Huang, Yefeng Zheng, Jürgen Schmidhuber, Bernard Ghanem, Linlin Shen, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code] Highlight, Acceptance Rate 2.8%

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing

Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, Yuchao Gu, Rui Zhao, Jussi Keppo, Ying Shan, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis

Yuchao Gu, Xintao Wang, Yixiao Ge, Ying Shan, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence

Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Dechau Tang.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

ViT-Lens: Towards Omni-modal Representations

Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

AssistGUI: Task-Oriented Desktop Graphical User Interface Automation. 

Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qin chen WU, Weichen Zhang, WANG PEIYI, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens. 

Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou.

The International Conference on Learning Representations (ICLR), 2024. [arxiv] [code]

What is the successor of Vision Transformer? We present an efficient paradigm which has better accuracy-throughput tradeoff on both image and video tasks

2023

GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations.

Muhammet Furkan ILASLAN, Chenan Song, Joya Chen, Difei Gao, Weixian Lei, Qianli Xu, Joo Hwee Lim, Mike Zheng Shou.

Empirical Methods in Natural Language Processing (EMNLP), 2023. 

VisorGPT: Learning Visual Prior via Generative Pre-Training. 

Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin Qinghong Lin, Yefeng Zheng, Linlin Shen, Mike Zheng Shou.

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. [arxiv]

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models. 

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, Mike Zheng Shou.

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. [arxiv]

XAGen: 3D Expressive Human Avatars Generation. 

Eric Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Jiashi Feng, Mike Zheng Shou.

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023.

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models. 

Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen.

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. [arxiv]

Object-centric Learning with Cyclic Walks between Parts and Whole. 

Ziyu Wang, Mike Zheng Shou, Mengmi Zhang.

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. [arxiv]

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation. 

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

The 1st open-source video diffusion model at https://github.com/showlab/Tune-A-Video

Revisiting Vision Transformer from the View of Path Ensemble. 

Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv] Oral Presentation, Acceptance Rate 1.58%

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion. 

Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

Without training at all, condition image diffusion models with box or scribble control

UniVTG: Towards Unified Video-Language Temporal Grounding. 

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Jinpeng Wang, Rui Yan, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

The 1st work to unify the video temporal grounding task-specific labels: moment retrieval (interval), highlight detection (curve) and video summarization (point)

Too Large; Data Reduction for Vision-Language Pre-Training. 

Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

TL;DR -- compress the existing large vision-language pre-training dataset into a small, high-quality set

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video. 

Jia-Wei Liu, Yan-Pei Cao, Tianyuan Yang, Eric Zhongcong Xu, Jussi Keppo, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [project page]

From a single video input, reconstruct dynamic human-object-scene of high-fidelity, so that can pause the input video at any time and view it from any direction

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone. 

Shraman Pramanick, Yale Song, Sayan Nag, Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

The new version of our EgoVLP for egocentric video-language pre-training

Label-Efficient Online Continual Object Detection in Streaming Video. 

Jay Zhangjie Wu, David Junhao Zhang, Wynne Hsu, Mengmi Zhang, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023.

We examine a more realistic and challenging problem -- Label-Efficient Online Continual Object Detection in video streams

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models. 

Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, Chunhua Shen.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

The 1st work of leveraging pre-trained image diffusion model to boost open-vocabulary semantic segmentation.

Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization.

Xue Xizhe, Dongdong Yu, Lingqiao Liu, Yu Liu, Ying Li, Zehuan Yuan, Ping Song, Mike Zheng Shou.

ACM Multimedia (ACM MM), 2023. [arxiv] 

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering. 

Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv] 

A video question answering system that can assist humans in daily activities by analyzing long-form videos with diverse and complex events.

All in One: Exploring Unified Video-Language Pre-training. 

Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv] 

The first unified video-language large-scale pretrained model, outperforming state-of-the-art methods on 9 datasets such as those for text-video retrieval, video-question answering, multiple choice and visual commonsense reasoning.

Making Vision Transformers Efficient from A Token Sparsification View. 

Shuning Chang, Pichao Wang, Ming Lin, Fan Wang, David Junhao Zhang, Rong Jin, Mike Zheng Shou.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv] 

Position-guided Text Prompt for Vision Language Pre-training. 

Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv] 

Affordance Grounding from Demonstration Video to Target Image. 

Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv] 

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval. 

Xudong Lin, Simran Tiwari, Shiyuan Huang, Manling Li, Mike Zheng Shou, Heng Ji, Shih-Fu Chang.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv] 

PV3D: A 3D Generative Model for Portrait Video Generation

Eric Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Wenqing Zhang, Song Bai, Jiashi Feng, Mike Zheng Shou.

The International Conference on Learning Representations (ICLR), 2023. [arxiv] [demo

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task

Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yuxuan Wang, Wei Liu, Mengmi Zhang, Mike Zheng Shou.

The AAAI Conference on Artificial Intelligence (AAAI), 2023. [arxiv] Oral Presentation. The first continual learning work for VQA

2022

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

Stan Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, Mike Zheng Shou.

Findings of Empirical Methods in Natural Language Processing (EMNLP), 2022. [arxiv] Towards personal AI assistant on AR glass: language + vision + "point at"

Egocentric Video-Language Pretraining. (a.k.a. EgoVLP)

Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou.

Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), 2022. [arxiv] Spotlight, 1.7% acceptance rate. The first work to pioneer vision-language foundation model for egocentric video

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes.

Jia-Wei Liu, Yan-Pei Cao, Weijia Mao, Wenqiao Zhang, David Junhao Zhang, Jussi Keppo, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), 2022. [arxiv] Make NeRF 100x faster in modelling dynamic scenes 

AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant

Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou.

European Conference on Computer Vision (ECCV), 2022. [arxiv] Towards personal AI assistant on AR glass: language + vision + "point at"

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Text-based Retrieval

Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou.

European Conference on Computer Vision (ECCV), 2022. [arxiv]

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video

David Junhao Zhang, Kunchang Li, Yunpeng Chen, Yali Wang, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou.

European Conference on Computer Vision (ECCV), 2022. [arxiv] [news 机器之心]

Ego4D: Around the World in 3,000 Hours of Egocentric Video

Grauman et al. [ Singapore Site Author List: Eric Zhongcong Xu, Ruijie Tao, Yunyi Zhu, Haizhou Li, Mike Zheng Shou. ]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [arxivOral presentation, best paper finalist.

Object-aware Video-language Pre-training for Retrieval

Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [arxiv

Unified Transformer Tracker for Object Tracking

Fan Ma, Mike Zheng Shou, Linchao Zhu, Haoqi Fan, Yilei Xu, Yi Yang, Zhicheng Yan.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [arxiv]

2021

Generic Event Boundary Detection: A Benchmark for Event Segmentation

Mike Zheng Shou, Stan W. Lei, Deepti Ghadiyaram, Weiyao Wang, Matt Feiszli.

International Conference on Computer Vision (ICCV), 2021. [arxiv

The first large-scale taxonomy-free event segmentation benchmark. A stepping stone to addressing long-form video understanding. We organised a workshop called LOVEU at CVPR'21 along with competitions built upon this dataset. The competitions attracted 20+ participants!

Channel Augmented Joint Learning for Visible-Infrared Recognition

Mang Ye, Weijian Ruan, Bo Du, Mike Zheng Shou.

International Conference on Computer Vision (ICCV), 2021.

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection

Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li.

ACM Multimedia, 2021. Oral presentation. [AVA challenge report

Leverage video + audio to detect active speakers. Secure the 3rd place in AVA challenge organised by Google Research at CVPR'21 ActivityNet.

On Pursuit of Designing Multi-modal Transformer for Video Grounding

Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou.

Empirical Methods in Natural Language Processing (EMNLP), 2021. Oral presentation.

The first work to explore how to design video-language Transformer for temporally grounding textual query in long video.

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization

Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, Hongsheng Li.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. [arxiv

Localize action in space and time. Core technique of the 1st place in AVA challenge organised by Google Research at CVPR'20 ActivityNet.

Earlier Publications

SF-Net: Single-Frame Supervision for Temporal Action Localization.

Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, Zheng Shou.

European Conference on Computer Vision (ECCV), 2020. Spotlight, acceptance rate top 5%. [arxiv

A new form of weak supervision, comparable results to its fully-supervised counterpart with much cheaper annotation cost.

DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition.

Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, Zhicheng Yan.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [arxiv

A video model that learns discriminative motion cues directly from compressed video - fast & accurate.

AutoLoc: Weakly-supervised Temporal Action Localization in Untrimmed Videos.

Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, Shih-Fu Chang.

European Conference on Computer Vision (ECCV), 2018. [arxiv

Online Detection of Action Start in Untrimmed, Streaming Videos.

Zheng Shou*, Junting Pan*, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavi Giró-i-Nieto, Shih-Fu Chang.

European Conference on Computer Vision (ECCV), 2018. [arxiv

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos.

Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, Shih-Fu Chang.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [arxiv] oral presentation, acceptance rate 2.6%, best student paper nomination.

ConvNet Architecture Search for Spatiotemporal Feature Learning.

Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, Manohar Paluri.

Technical Report, 2017. [arxiv] [github]

A open-source Res3D video backbone model that can support many video applications.

Single Shot Temporal Action Detection.

Tianwei Lin, Xu Zhao, Zheng Shou.

ACM Multimedia, 2017. [paper] [challenge report

won the first place in both Temporal Action Proposal track and Temporal Action Localization track at the ActivityNet Challenge.

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs.

Zheng Shou, Dongang Wang, and Shih-Fu Chang.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [arxiv

A pioneering work that proposes the first deep learning framework for temporal action localization in video.