Show Lab @ NUS - Publications

Find the full list & our latest work at [Google Scholar]

Representative Papers

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou.

ICLR 2025. [arxiv] [code]

Pioneer unifying multimodal understanding and generation, autoregressive and discrete diffusion in one transformer.

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation.

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou.

IJCV 2024. [arxiv] [project page] [code]

Efficient video diffusion foundation model. 1K+ GitHub Stars.

ShowUI: One Vision-Language-Action Model for GUI Visual Agent.

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou.

CVPR 2025. [arxiv] [code] [model]

Pioneer end-to-end vision-language-action model for agent.

The model has been downloaded for over 230,000 times.

Egocentric Video-Language Pretraining. (a.k.a. EgoVLP)

Kevin Qinghong Lin, Alex Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Zhongcong Xu, Difei Gao, Rongcheng Tu, Wenzhe Zhao, Weijie Kong, Chengfei Cai, Hongfa Wang, Dima Damen, Bernard Ghanem, Wei Liu, Mike Zheng Shou.

NeurIPS 2022. [arxiv] [project page] [code]

Spotlight, 1.7% acceptance rate. The 1st work to pioneer vision-language foundation model for egocentric video.

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model.

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou.

CVPR 2024. [arxiv] [project page] [code]

Human-centric long video generation. 10K+ GitHub Stars.

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

ICCV 2023. [arxiv] [project page] [code]

Pioneer efficient video diffusion model. 1 GPU 10mins training.

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos.

Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, Shih-Fu Chang.

CVPR 2017. [arxiv] [project page] [code]

oral presentation, acceptance rate 2.6%, best student paper nomination.

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs.

Zheng Shou, Dongang Wang, Shih-Fu Chang.

CVPR 2016. [arxiv] [code]

A pioneering work that proposes the first deep learning framework for temporal action localization in video.

Full List

2026

ReGRPO: Reflection-Augmented Policy Optimization for Tool-Using Agents

Binjie Zhang, Mike Zheng Shou

The 19th European Conference on Computer Vision (ECCV), 2026.

WorldWander: Bridging Egocentric and Exocentric Worlds in Video Generation

Quanjian Song, Yiren Song, Kelly Peng, Yuan Gao, Mike Zheng Shou

The 19th European Conference on Computer Vision (ECCV), 2026.

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

Yuchao Gu, Guian Fang, Yuxin Jiang, Weijia Mao, Song Han, Han Cai, Mike Zheng Shou

The 19th European Conference on Computer Vision (ECCV), 2026.

PAI-Studio: Cinematic Video Background Replacement with Camera-Aware Motion

Heyuan Gao, Bangxun Tang, Yiren Song, Guian Fang, Zijian He, Jie Yang, Mike Zheng Shou

The 19th European Conference on Computer Vision (ECCV), 2026.

GameWorlds: Towards Standardized and Verifiable Evaluation of Multimodal Game Agents

Mingyu Ouyang, Siyuan Hu, Kevin Qinghong Lin, Hwee Tou Ng, Mike Zheng Shou

The 19th European Conference on Computer Vision (ECCV), 2026.

Parallelized Autoregressive Decoding for Omni-Modal Dense Video Captioning

Wenzheng Zeng, Siyi Jiao, Chen Gao, Hwee Tou Ng, Mike Zheng Shou

The 19th European Conference on Computer Vision (ECCV), 2026.

A Survey on Foundations and Frontiers of Multimodal Agentic Frameworks: Techniques and Applications

Neel Mokaria, Rishie Raj, Dheeraj Baiju, Xiaoqian Shen, Shraman Pramanick, Kevin Qinghong Lin, Arda Senocak, Mike Zheng Shou, Philip Torr, Mohamed Elhoseiny, Yapeng Tian, Ruohan Gao, Salman Khan, Sayan Nag, Sanjoy Chowdhury, Dinesh Manocha

Transactions on Machine Learning Research (TMLR), 2026.

Diffusion Models in Robotics: A Survey

Xiaokang Liu, Kevin Yuchen Ma, Chen Gao, Mike Zheng Shou

International Journal of Computer Vision (IJCV), 2026.

FedIGA: Improving Global Model Aggregation for Federated Self-Supervised Skeletal Action Recognition

Binqian Xu, Xiangbo Shu, Chun-Mei Feng, Basura Fernando, Mike Zheng Shou

IEEE Transactions on Multimedia (TMM), 2026.

Semantic-Contact Fields for Category-Level Generalizable Tactile Tool Manipulation

Kevin Yuchen Ma, Heng Zhang, Weisi Lin, Mike Zheng Shou and Yan Wu

Robotics: Science and Systems (RSS), 2026.

Olaf-World: Orienting Latent Actions for Video World Modeling

Yuxin Jiang, Yuchao Gu, Ivor Tsang, Mike Zheng Shou

The Forty-Third International Conference on Machine Learning (ICML), 2026.

Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

Yanzhe Chen, Kevin Yuchen Ma, Qi Lv, Yiqi Lin, Zechen Bai, Chen Gao, Mike Zheng Shou

The Forty-Third International Conference on Machine Learning (ICML), 2026.

Code2Video: A Code-centric Paradigm for Educational Video Creation

Yanzhe Chen, Kevin Qinghong Lin, Mike Zheng Shou

The Forty-Third International Conference on Machine Learning (ICML), 2026.

The Consistency Critic: Correcting Inconsistencies in Generated Images via Reference-Guided Attentive Alignment

Ziheng Ouyang, Yiren Song, Yaoli Liu, Shihao Zhu, Qibin Hou, Ming-Ming Cheng, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.

Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers

Yiqing Shi, Yiren Song, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.

Demo2Tutorial: From Human Experience to Multimodal Software Tutorials

Zechen Bai, Zhiheng Chen, Yiqi Lin, Kevin Qinghong Lin, Difei Gao, Xiangwu Guo, WANG XIN, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.

RobotSeg: A Model and Dataset for Segmenting Robots in Image and Video

Haiyang Mei, Huang Qiming, Hai Ci, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. Oral Presentation

FocusUI: Efficient UI Grounding via Position-Preserving Visual Token Selection

Mingyu Ouyang, Kevin Qinghong Lin, Mike Zheng Shou, Hwee Tou Ng

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.

MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

Yiren Song, Cheng Liu, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.

P-Flow: Prompting Visual Effects Generation

Rui Zhao, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.

ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands

Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.

The Image as Its Own Reward: Reinforcement Learning with Adversarial Reward for Image Generation

Weijia Mao, Hao Chen, Zhenheng Yang, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026.

Diffusion-Driven Self-Supervised Learning for Shape Reconstruction and Pose Estimation

Jingtao Sun, Yaonan Wang, Mingtao Feng, Chao Ding, Mike Zheng Shou and Ajmal Mian

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.

Cross-Embodiment Dexterous Hand Articulation Generation via Morphology-Aware Learning

Heng Zhang, Kevin Yuchen Ma, Mike Zheng Shou, Weisi Lin and Yan Wu

The IEEE International Conference on Robotics and Automation (ICRA), 2026.

VideoMind: A Chain-of-LoRA Agent for Temporal-Grounded Video Reasoning

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen and Mike Zheng Shou

The Fourteenth International Conference on Learning Representations (ICLR), 2026.

Draw-In-Mind: Rebalancing Designer-Painter Roles in Unified Multimodal Models Benefits Image Editing

Ziyun Zeng, David Junhao Zhang, Wei Li and Mike Zheng Shou

The Fourteenth International Conference on Learning Representations (ICLR), 2026.

TPDiff: Temporal Pyramid Video Diffusion Model

Lingmin Ran and Mike Zheng Shou

The Fourteenth International Conference on Learning Representations (ICLR), 2026.

D-AR: Diffusion via Autoregressive Models

Ziteng Gao and Mike Zheng Shou

The Fourteenth International Conference on Learning Representations (ICLR), 2026.

SAM-I2V++: Efficiently Upgrading SAM for Promptable Video Segmentation

Haiyang Mei, Pengyu Zhang and Mike Zheng Shou

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.

Language-4D Cross-Boosting for Generalized Zero-Shot 6DoF Tracking and 3D Reconstruction

Jingtao Sun, Yaonan Wang^, Jiawen Zhao, Yike Zhang, Min Liu and Mike Zheng Shou

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.

OptMark: Robust Multi-bit Diffusion Watermarking via Inference Time Optimization

Jiazheng Xing, Hai Ci, Hongbin Xu, Hangjie Yuan, Yong Liu, Mike Zheng Shou

The 40th Annual AAAI Conference on Artificial Intelligence, 2026.

2025

PANDA: Towards Generalist Video Anomaly Detection via Agentic AI Engineer

Zhiwei Yang, Chen Gao, Mike Zheng Shou

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.

OmniConsistency: Learning Style-Agnostic Consistency from Paired Stylization Data

Yiren Song, Cheng Liu, Mike Zheng Shou

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.

DOTA: Distributional Test-time Adaptation of Vision-Language Models

Zongbo Han, Jialong Yang, Guangyu Wang, Junfan Li, Qianli Xu, Mike Zheng Shou, Changqing Zhang

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.

Improved Native Unified Multimodal Models

Jinheng Xie, Zhenheng Yang, Mike Zheng Shou

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.

macOSWorld: A Multilingual Interactive Benchmark for GUI Agents

Pei Yang, Hai Ci, Mike Zheng Shou

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.

CoFFT: Chain of Foresight-Focus Thought for Visual Language Models

Xinyu Zhang, Yuxuan Dong, Lingling Zhang, Chengyou Jia, Zhuohang Dang, Basura Fernando, Jun Liu, Mike Zheng Shou

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Jiaqi WANG, Kevin Qinghong Lin, James Cheng, Mike Zheng Shou

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.

Sparse Image Synthesis via Joint Latent and RoI Flow

Ziteng Gao, Jay Zhangjie Wu, Mike Zheng Shou

The Thirty-Ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025.

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models with Human Feedback

Henry Hengyuan Zhao, Wenqi Pei, Yifei Tao, Haiyang Mei, Mike Zheng Shou

Findings of Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025.

ColonNeRF: High-Fidelity Neural Reconstruction of Long Colonoscopy Neurocomputing

Yufei Shi, Beijia Lu, Jia-Wei Liu, Ming Li, Si Yong Yeo, Mike Zheng Shou

Neurocomputing, 2025.

Open-world Weakly-Supervised Object Localization

Jinheng Xie, Zhaochuan Luo, Rouyi Li, Yawen Huang, Haozhe Liu, Yuexiang Li, Yefeng Zheng, Yang Zhang, Linlin Shen, Mike Zheng Shou

Pattern Recognition (PR), 2025. [arxiv]

GUI-Narrator: Detecting and Captioning Computer GUI Actions

Qinchen Wu, Difei Gao, Kevin Qinghong Lin, Zhuoyu Wu, Mike Zheng Shou

ACM Multimedia (ACM MM), 2025. [arxiv] [website]

Can I Trust You? Advancing GUI Task Automation with Action Trust Score

Haiyang Mei, Difei Gao, Xiaopeng Wei, Xin Yang, Mike Zheng Shou

ACM Multimedia (ACM MM), 2025. [arxiv]

Factorized Learning for Temporally Grounded Video Language Models

Wenzheng Zeng, Difei Gao, Mike Zheng Shou*, Hwee Tou Ng*

International Conference on Computer Vision (ICCV), 2025. [arxiv]

Balanced Image Stylization with Style Matching Score

Yuxin Jiang, Liming Jiang, Shuai Yang, Jia-Wei Liu, Ivor Tsang, Mike Zheng Shou

International Conference on Computer Vision (ICCV), 2025. [arxiv] [website]

LayerTracer: Cognitive-Aligned Layered SVG Synthesis via Diffusion Transformer

Yiren Song, Danze Chen, Mike Zheng Shou

International Conference on Computer Vision (ICCV), 2025. [arxiv] [website] Oral Presentation, acceptance rate 3.3%

DiffSim: Taming Diffusion Models for Evaluating Visual Similarity

Yiren Song, Xiaokang Liu, Mike Zheng Shou

International Conference on Computer Vision (ICCV), 2025. [arxiv] [website]

Impossible Videos

Zechen Bai, Hai Ci, Mike Zheng Shou

Forty-second International Conference on Machine Learning (ICML), 2025. [arxiv] [website]

WMAdapter: Adding WaterMark Control to Latent Diffusion Models

Hai Ci, Yiren Song, Pei Yang, Jinheng Xie, Mike Zheng Shou

Forty-second International Conference on Machine Learning (ICML), 2025. [arxiv]

LiveCC: Learn Streaming Video LLM with Speech Transcription at Scale

Joya Chen, Yiqi Lin, Ziyun Zeng, Wei Li, Zejun Ma, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR, 2025. [arxiv] [website]

DIFIX3D+: Improving 3D Reconstructions with Single-Step Diffusion Models

Zhangjie Wu, Yuxuan Zhang, Haithem Turki, Xuanchi Ren, Jun Gao, Mike Zheng Shou, Sanja Fidler, Zan Gojcic, Huan Ling

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arxiv]

SAM-I2V: Upgrading SAM to Support Promptable Video Segmentation with Less than 0.2\% Training Cost

Haiyang Mei, Pengyu Zhang, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arxiv]

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

Weijia Wu, Mingyu Liu, Zeyu Zhu, Feng Haoen, Xi Xia, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arxiv]

IDProtector: An Adversarial Noise Encoder to Protect Against ID-Preserving Image Generation

Yiren Song, Pei Yang, Hai Ci, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arxiv]

ReCapture: Generative Video Camera Controls for User-Provided Videos using Masked Video Fine-Tuning

David Junhao Zhang, Roni Paiss, Shiran Zada, Nikhil Karnad, David E. Jacobs, Yael Pritch, Inbar Mosseri, Mike Zheng Shou, Neal Wadhwa, Nataniel Ruiz

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arxiv]

DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles

Rui Zhao, Weijia Mao, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arxiv]

ROICtrl: Boosting Instance Control for Visual Generation

Yuchao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arxiv]

VLog: Video-Language Models by Generative Retrieval of Narration Vocabulary

Kevin Qinghong Li, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arxiv]

ShowUI: One Vision-Language-Action Model for GUI Visual Agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, Mike Zheng Shou

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. [arxiv]

Faster Diffusion Through Temporal Attention Decomposition

Haozhe Liu, Wentian Zhang, Jinheng Xie, Francesco Faccio, Mengmeng Xu, Tao Xiang, Mike Zheng Shou, Juan-Manuel Perez-Rua, Jürgen Schmidhuber

Transactions on Machine Learning Research, 2025. [arxiv] [code]

Grounding Multimodal Large Language Model in GUI World.

Weixian Lei, Difei Gao, Mike Zheng Shou.

The International Conference on Learning Representations (ICLR), 2025. [arxiv] [code]

Bridging Information Asymmetry in Text-video Retrieval: A Data-centric Approach.

Zechen Bai, Tianjun Xiao, Tong He, Pichao WANG, Zheng Zhang, Thomas Brox, Mike Zheng Shou.

The International Conference on Learning Representations (ICLR), 2025. [arxiv] [code]

MP-Mat: A 3D-and-Instance-Aware Matting Framework with Multiplane Representation.

Siyi Jiao, Wenzheng Zeng, Yerong Li, Huayu Zhang, Changxin Gao, Nong Sang, Mike Zheng Shou.

The International Conference on Learning Representations (ICLR), 2025. [arxiv] [code]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation.

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, Mike Zheng Shou.

The International Conference on Learning Representations (ICLR), 2025. [arxiv] [code]

Image Watermarks are Removable using Controllable Regeneration from Clean Noise.

Yepeng Liu, Yiren Song, Hai Ci, Yu Zhang, Haofan Wang, Mike Zheng Shou, Yuheng Bu.

The International Conference on Learning Representations (ICLR), 2025. [arxiv] [code]

VG-TVP: Multimodal Procedural Planning via Visually Grounded Text-Video Prompting

Muhammet Furkan ILASLAN, Ali Koksal, Kevin Qinghong Lin, Burak Satar, Mike Zheng Shou, Qianli Xu

The 39th Annual AAAI Conference on Artificial Intelligence, 2025. [arxiv] [dataset]

Show-1: Marrying Pixel and Latent Diffusion Models for Text-to-Video Generation.

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, Mike Zheng Shou

International Journal of Computer Vision (IJCV), 2024. [arxiv] [code] [website]

2024

Apprenticeship-Inspired Elegance: Synergistic Knowledge Distillation Empowers Spiking Neural Networks for Efficient Single-Eye Emotion Recognition

Yang Wang, Haiyang Mei, Qirui Bao, Ziqi Wei, Mike Zheng Shou, Haizhou Li, Bo Dong and Xin Yang

Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI), 2024.

Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks

Zeyang Song, Jibin Wu, Malu Zhang, Mike Zheng Shou, Haizhou Li

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.

LOVA3: Learning to Visual Question Answering, Asking and Assessment.

Henry Hengyuan Zhao, Pan Zhou, Difei Gao, Zechen Bai, Mike Zheng Shou.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024. [arxiv] [code] [website]

VideoLLM-MoD: Efficient Video-Language Streaming with Mixture-of-Depths Vision Computation.

Shiwei Wu, Joya Chen, Kevin Qinghong Lin, Qimeng Wang, Yan Gao, Qianli Xu, Tong Xu, Yao Hu, Enhong Chen, Mike Zheng Shou.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024. [arxiv]

One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos.

Zechen Bai, Tong He, Haiyang Mei, Pichao WANG, Ziteng Gao, Joya Chen, liulei, Zheng Zhang, Mike Zheng Shou.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024. [arxiv] [code]

DoFIT: Domain-aware Federated Instruction Tuning with Alleviated Catastrophic Forgetting.

Binqian Xu, Xiangbo Shu, Haiyang Mei, Zechen Bai, Basura Fernando, Mike Zheng Shou, Jinhui Tang.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024.

Steganalysis on Digital Watermarking: Is Your Robustness a Maginot Line?

Pei Yang, Hai Ci, Yiren Song, Mike Zheng Shou.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024.

Skinned Motion Retargeting with Dense Geometric Interaction Perception.

Zijie Ye, Jia-Wei Liu, Shikun Sun, Jia Jia, Mike Zheng Shou.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024. [arxiv] [code] [website]

Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning.

Jinpeng Wang, Linjie Li, Yiqi Lin, Min Li, Lijuan Wang, Mike Zheng Shou.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024. [arxiv] [code] [website]

Exocentric-to-Egocentric Video Generation.

Jia-Wei Liu, Weijia Mao, Zhongcong XU, Jussi Keppo, Mike Zheng Shou.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024.

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models.

Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, Yuchao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024. [arxiv] [code]

VideoGUI: A Benchmark for GUI Automation from Instructional Videos.

Kevin Qinghong Lin, Linjie Li, Difei Gao, Qinchen WU, Mingyi Yan, Zhengyuan Yang, Lijuan Wang, Mike Zheng Shou.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024. [arxiv] [code] [website] Spotlight

Visual Perception by Large Language Model’s Weights.

Feipeng Ma, Hongwei Xue, Yizhou Zhou, Guangting Wang, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, Xiaoyan Sun.

Thirty-Eighth Conference on Neural Information Processing Systems (NeurIPS), 2024. [arxiv]

ProcessPainter: Learn Painting Process from Sequence Data.

Yiren Song, Shijie Huang, Chen Yao, Xiaojun Ye, Hai Ci, Jiaming Liu, Yuxuan Zhang, Mike Zheng Shou.

The 17th ACM SIGGRAPH Conference and Exhibition on Computer Graphics and Interactive Techniques in Asia (SIGGRAPH Asia), 2024. [arxiv] [code]

Parrot Captions Teach CLIP to Spot Text.

Yiqi Lin , Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou.

The 18th European Conference on Computer Vision ECCV 2024 (ECCV), 2024. [arxiv] [code] [website] Oral

MotionDirector: Motion Customization of Text-to-Video Diffusion Models.

Rui Zhao, Yuchao Gu, Jay Zhangjie Wu, David Junhao Zhang, Jia-Wei Liu, Weijia Wu, Jussi Keppo, Mike Zheng Shou.

The 18th European Conference on Computer Vision ECCV 2024 (ECCV), 2024. [arxiv] [code] [website] Oral

Learning Video Context as Interleaved Multimodal Sequences.

Kevin Qinghong Lin, Pengchuan Zhang, Difei Gao, Xide Xia, Joya Chen, Ziteng Gao, Jinheng Xie, Xuhong Xiao, Mike Zheng Shou.

The 18th European Conference on Computer Vision ECCV 2024 (ECCV), 2024. [arxiv] [code]

Genixer: Empowering Multimodal Large Language Model as a Powerful Data Generator.

Henry Hengyuan Zhao, Pan Zhou, Mike Zheng Shou.

The 18th European Conference on Computer Vision ECCV 2024 (ECCV), 2024. [arxiv] [code]

Free-atm: Exploring unsupervised learning on diffusion-generated images with free attention masks.

David Junhao Zhang, Mutian Xu, Chuhui Xue, Wenging Zhang, Xiaoguang Han, Song Bai, Mike Zheng Shou.

The 18th European Conference on Computer Vision ECCV 2024 (ECCV), 2024. [arxiv] [code]

Drag Anything: Motion Control for Anything using Entity Representation.

Weijia Wu, Zhuang Li, Yuchao Gu, Rui Zhao, Yefei He, David Junhao Zhang, Mike Zheng Shou, Yan Li, Tingting Gao, Zhang Di.

The 18th European Conference on Computer Vision ECCV 2024 (ECCV), 2024. [arxiv] [code] [website]

Bootstrapping SparseFormers from Vision Foundation Models.

Ziteng Gao, Zhan Tong, Kevin Qinghong Lin, Joya Chen, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

VideoLLM-online: Towards Large Video-Language Model for Streaming Video.

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code] [website]

Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives.

Grauman et al. [ Singapore Site Author List: Joya Chen, Jia-Wei Liu, Xinzhu Fu, Chenan Song, Mike Zheng Shou. ]

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code] Oral

Adding Universal Compatibility of Plugins for Upgraded Diffusion Model.

Lingmin Ran, Xiaodong Cun, Jia-Wei Liu, Rui Zhao, Song Zijie, Xintao Wang, Jussi Keppo, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream.

Jingtao Sun, Yaonan Wang, Mingtao Feng, Yulan Guo, Ajmal Saeed Mian, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

Tune-An-Ellipse: CLIP Has Potential to Find What You Want.

Jinheng Xie, Songhe Deng, Bing Li, Haozhe Liu, Yawen Huang, Yefeng Zheng, Jürgen Schmidhuber, Bernard Ghanem, Linlin Shen, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code] Highlight, Acceptance Rate 2.8%

DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing.

Jia-Wei Liu, Yan-Pei Cao, Jay Zhangjie Wu, Weijia Mao, Yuchao Gu, Rui Zhao, Jussi Keppo, Ying Shan, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis.

Yuchao Gu, Xintao Wang, Yixiao Ge, Ying Shan, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence.

Yuchao Gu, Yipin Zhou, Bichen Wu, Licheng Yu, Jia-Wei Liu, Rui Zhao, Jay Zhangjie Wu, David Junhao Zhang, Mike Zheng Shou, Kevin Dechau Tang.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model.

Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Hanshu Yan, Jia-Wei Liu, Chenxu Zhang, Jiashi Feng, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

ViT-Lens: Towards Omni-modal Representations.

Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

AssistGUI: Task-Oriented Desktop Graphical User Interface Automation.

Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qin chen WU, Weichen Zhang, WANG PEIYI, Xiangwu Guo, Hengxu Wang, Luowei Zhou, Mike Zheng Shou.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. [arxiv] [code]

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens.

Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou.

The International Conference on Learning Representations (ICLR), 2024. [arxiv] [code]

What is the successor of Vision Transformer? We present an efficient paradigm which has better accuracy-throughput tradeoff on both image and video tasks

2023

The Metaverse Data Deluge: What Can We Do About It?

Beng Chin Ooi, Gang Chen, Mike Zheng Shou, Kian-Lee Tan, Anthony Tung, Xiaokui Xiao, James Wei Luen Yip, Meihui Zhang.

The 39th IEEE International Conference on Data Engineering (ICDE), 2023.

STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition

Ming Li, Xiangyu Xu, Hehe Fan, Pan Zhou, Jun Liu, Jia-Wei Liu, Jiahe Li, Jussi Keppo, Mike Zheng Shou, Shuicheng Yan, Mike Zheng Shou, Shuicheng Yan.

The IEEE/CVF International Conference on Computer Vision (ICCV), 2023.

GazeVQA: A Video Question Answering Dataset for Multiview Eye-Gaze Task-Oriented Collaborations.

Muhammet Furkan ILASLAN, Chenan Song, Joya Chen, Difei Gao, Weixian Lei, Qianli Xu, Joo Hwee Lim, Mike Zheng Shou.

Empirical Methods in Natural Language Processing (EMNLP), 2023.

VisorGPT: Learning Visual Prior via Generative Pre-Training.

Jinheng Xie, Kai Ye, Yudong Li, Yuexiang Li, Kevin Qinghong Lin, Yefeng Zheng, Linlin Shen, Mike Zheng Shou.

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. [arxiv]

Mix-of-Show: Decentralized Low-Rank Adaptation for Multi-Concept Customization of Diffusion Models.

Yuchao Gu, Xintao Wang, Jay Zhangjie Wu, Yujun Shi, Yunpeng Chen, Zihan Fan, Wuyou Xiao, Rui Zhao, Shuning Chang, Weijia Wu, Yixiao Ge, Ying Shan, Mike Zheng Shou.

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. [arxiv]

XAGen: 3D Expressive Human Avatars Generation.

Eric Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Jiashi Feng, Mike Zheng Shou.

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023.

DatasetDM: Synthesizing Data with Perception Annotations Using Diffusion Models.

Weijia Wu, Yuzhong Zhao, Hao Chen, Yuchao Gu, Rui Zhao, Yefei He, Hong Zhou, Mike Zheng Shou, Chunhua Shen.

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. [arxiv]

Object-centric Learning with Cyclic Walks between Parts and Whole.

Ziyu Wang, Mike Zheng Shou, Mengmi Zhang.

Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023. [arxiv]

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation.

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

The 1st open-source video diffusion model at https://github.com/showlab/Tune-A-Video

Revisiting Vision Transformer from the View of Path Ensemble.

Shuning Chang, Pichao Wang, Hao Luo, Fan Wang, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv] Oral Presentation, Acceptance Rate 1.58%

BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion.

Jinheng Xie, Yuexiang Li, Yawen Huang, Haozhe Liu, Wentian Zhang, Yefeng Zheng, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

Without training at all, condition image diffusion models with box or scribble control

UniVTG: Towards Unified Video-Language Temporal Grounding.

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Jinpeng Wang, Rui Yan, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

The 1st work to unify the video temporal grounding task-specific labels: moment retrieval (interval), highlight detection (curve) and video summarization (point)

Too Large; Data Reduction for Vision-Language Pre-Training.

Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

TL;DR -- compress the existing large vision-language pre-training dataset into a small, high-quality set

HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video.

Jia-Wei Liu, Yan-Pei Cao, Tianyuan Yang, Eric Zhongcong Xu, Jussi Keppo, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [project page]

From a single video input, reconstruct dynamic human-object-scene of high-fidelity, so that can pause the input video at any time and view it from any direction

EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone.

Shraman Pramanick, Yale Song, Sayan Nag, Qinghong Lin, Hardik Shah, Mike Zheng Shou, Rama Chellappa, Pengchuan Zhang.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

The new version of our EgoVLP for egocentric video-language pre-training

Label-Efficient Online Continual Object Detection in Streaming Video.

Jay Zhangjie Wu, David Junhao Zhang, Wynne Hsu, Mengmi Zhang, Mike Zheng Shou.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023.

We examine a more realistic and challenging problem -- Label-Efficient Online Continual Object Detection in video streams

DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models.

Weijia Wu, Yuzhong Zhao, Mike Zheng Shou, Hong Zhou, Chunhua Shen.

IEEE Conference on International Conference on Computer Vision (ICCV), 2023. [arxiv]

The 1st work of leveraging pre-trained image diffusion model to boost open-vocabulary semantic segmentation.

Single-Stage Open-world Instance Segmentation with Cross-task Consistency Regularization.

Xue Xizhe, Dongdong Yu, Lingqiao Liu, Yu Liu, Ying Li, Zehuan Yuan, Ping Song, Mike Zheng Shou.

ACM Multimedia (ACM MM), 2023. [arxiv]

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering.

Difei Gao, Luowei Zhou, Lei Ji, Linchao Zhu, Yi Yang, Mike Zheng Shou.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv]

A video question answering system that can assist humans in daily activities by analyzing long-form videos with diverse and complex events.

All in One: Exploring Unified Video-Language Pre-training.

Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv]

The first unified video-language large-scale pretrained model, outperforming state-of-the-art methods on 9 datasets such as those for text-video retrieval, video-question answering, multiple choice and visual commonsense reasoning.

Making Vision Transformers Efficient from A Token Sparsification View.

Shuning Chang, Pichao Wang, Ming Lin, Fan Wang, David Junhao Zhang, Rong Jin, Mike Zheng Shou.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv]

Position-guided Text Prompt for Vision Language Pre-training.

Alex Jinpeng Wang, Pan Zhou, Mike Zheng Shou, Shuicheng Yan.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv]

Affordance Grounding from Demonstration Video to Target Image.

Joya Chen, Difei Gao, Kevin Qinghong Lin, Mike Zheng Shou.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv]

Towards Fast Adaptation of Pretrained Contrastive Models for Multi-channel Video-Language Retrieval.

Xudong Lin, Simran Tiwari, Shiyuan Huang, Manling Li, Mike Zheng Shou, Heng Ji, Shih-Fu Chang.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023. [arxiv]

PV3D: A 3D Generative Model for Portrait Video Generation.

Eric Zhongcong Xu, Jianfeng Zhang, Jun Hao Liew, Wenqing Zhang, Song Bai, Jiashi Feng, Mike Zheng Shou.

The International Conference on Learning Representations (ICLR), 2023. [arxiv] [demo]

Symbolic Replay: Scene Graph as Prompt for Continual Learning on VQA Task.

Stan Weixian Lei, Difei Gao, Jay Zhangjie Wu, Yuxuan Wang, Wei Liu, Mengmi Zhang, Mike Zheng Shou.

The AAAI Conference on Artificial Intelligence (AAAI), 2023. [arxiv] Oral Presentation. The first continual learning work for VQA

2022

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant.

Stan Weixian Lei, Difei Gao, Yuxuan Wang, Dongxing Mao, Zihan Liang, Lingmin Ran, Mike Zheng Shou.

Findings of Empirical Methods in Natural Language Processing (EMNLP), 2022. [arxiv] Towards personal AI assistant on AR glass: language + vision + "point at"

Egocentric Video-Language Pretraining. (a.k.a. EgoVLP)

Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), 2022. [arxiv] Spotlight, 1.7% acceptance rate. The first work to pioneer vision-language foundation model for egocentric video

DeVRF: Fast Deformable Voxel Radiance Fields for Dynamic Scenes.

Jia-Wei Liu, Yan-Pei Cao, Weijia Mao, Wenqiao Zhang, David Junhao Zhang, Jussi Keppo, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), 2022. [arxiv] Make NeRF 100x faster in modelling dynamic scenes

AssistQ: Affordance-centric Question-driven Task Completion for Egocentric Assistant.

Benita Wong, Joya Chen, You Wu, Stan Weixian Lei, Dongxing Mao, Difei Gao, Mike Zheng Shou.

European Conference on Computer Vision (ECCV), 2022. [arxiv] Towards personal AI assistant on AR glass: language + vision + "point at"

GEB+: A Benchmark for Generic Event Boundary Captioning, Grounding and Text-based Retrieval.

Yuxuan Wang, Difei Gao, Licheng Yu, Stan Weixian Lei, Matt Feiszli, Mike Zheng Shou.

European Conference on Computer Vision (ECCV), 2022. [arxiv]

MorphMLP: A Self-Attention Free, MLP-Like Backbone for Image and Video.

David Junhao Zhang, Kunchang Li, Yunpeng Chen, Yali Wang, Shashwat Chandra, Yu Qiao, Luoqi Liu, Mike Zheng Shou.

European Conference on Computer Vision (ECCV), 2022. [arxiv] [news 机器之心]

Ego4D: Around the World in 3,000 Hours of Egocentric Video.

Grauman et al. [ Singapore Site Author List: Eric Zhongcong Xu, Ruijie Tao, Yunyi Zhu, Haizhou Li, Mike Zheng Shou. ]

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [arxiv] Oral presentation, best paper finalist.

Object-aware Video-language Pre-training for Retrieval.

Alex Jinpeng Wang, Yixiao Ge, Guanyu Cai, Rui Yan, Xudong Lin, Ying Shan, Xiaohu Qie, Mike Zheng Shou.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [arxiv]

Unified Transformer Tracker for Object Tracking.

Fan Ma, Mike Zheng Shou, Linchao Zhu, Haoqi Fan, Yilei Xu, Yi Yang, Zhicheng Yan.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022. [arxiv]

2021

Generic Event Boundary Detection: A Benchmark for Event Segmentation.

Mike Zheng Shou, Stan W. Lei, Deepti Ghadiyaram, Weiyao Wang, Matt Feiszli.

International Conference on Computer Vision (ICCV), 2021. [arxiv]

The first large-scale taxonomy-free event segmentation benchmark. A stepping stone to addressing long-form video understanding. We organised a workshop called LOVEU at CVPR'21 along with competitions built upon this dataset. The competitions attracted 20+ participants!

Channel Augmented Joint Learning for Visible-Infrared Recognition.

Mang Ye, Weijian Ruan, Bo Du, Mike Zheng Shou.

International Conference on Computer Vision (ICCV), 2021.

Is Someone Speaking? Exploring Long-term Temporal Features for Audio-visual Active Speaker Detection.

Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, Haizhou Li.

ACM Multimedia, 2021. Oral presentation. [AVA challenge report]

Leverage video + audio to detect active speakers. Secure the 3rd place in AVA challenge organised by Google Research at CVPR'21 ActivityNet.

On Pursuit of Designing Multi-modal Transformer for Video Grounding.

Meng Cao, Long Chen, Mike Zheng Shou, Can Zhang, Yuexian Zou.

Empirical Methods in Natural Language Processing (EMNLP), 2021. Oral presentation.

The first work to explore how to design video-language Transformer for temporally grounding textual query in long video.

Actor-Context-Actor Relation Network for Spatio-Temporal Action Localization.

Junting Pan, Siyu Chen, Mike Zheng Shou, Yu Liu, Jing Shao, Hongsheng Li.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021. [arxiv]

Localize action in space and time. Core technique of the 1st place in AVA challenge organised by Google Research at CVPR'20 ActivityNet.

Earlier Publications

SF-Net: Single-Frame Supervision for Temporal Action Localization.

Fan Ma, Linchao Zhu, Yi Yang, Shengxin Zha, Gourab Kundu, Matt Feiszli, Zheng Shou.

European Conference on Computer Vision (ECCV), 2020. Spotlight, acceptance rate top 5%. [arxiv]

A new form of weak supervision, comparable results to its fully-supervised counterpart with much cheaper annotation cost.

DMC-Net: Generating Discriminative Motion Cues for Fast Compressed Video Action Recognition.

Zheng Shou, Xudong Lin, Yannis Kalantidis, Laura Sevilla-Lara, Marcus Rohrbach, Shih-Fu Chang, Zhicheng Yan.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019. [arxiv]

A video model that learns discriminative motion cues directly from compressed video - fast & accurate.

AutoLoc: Weakly-supervised Temporal Action Localization in Untrimmed Videos.

Zheng Shou, Hang Gao, Lei Zhang, Kazuyuki Miyazawa, Shih-Fu Chang.

European Conference on Computer Vision (ECCV), 2018. [arxiv]

Online Detection of Action Start in Untrimmed, Streaming Videos.

Zheng Shou*, Junting Pan*, Jonathan Chan, Kazuyuki Miyazawa, Hassan Mansour, Anthony Vetro, Xavi Giró-i-Nieto, Shih-Fu Chang.

European Conference on Computer Vision (ECCV), 2018. [arxiv]

CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos.

Zheng Shou, Jonathan Chan, Alireza Zareian, Kazuyuki Miyazawa, Shih-Fu Chang.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [arxiv] oral presentation, acceptance rate 2.6%, best student paper nomination.

ConvNet Architecture Search for Spatiotemporal Feature Learning.

Du Tran, Jamie Ray, Zheng Shou, Shih-Fu Chang, Manohar Paluri.

Technical Report, 2017. [arxiv] [github]

A open-source Res3D video backbone model that can support many video applications.

Single Shot Temporal Action Detection.

Tianwei Lin, Xu Zhao, Zheng Shou.

ACM Multimedia, 2017. [paper] [challenge report]

won the first place in both Temporal Action Proposal track and Temporal Action Localization track at the ActivityNet Challenge.

Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs.

Zheng Shou, Dongang Wang, and Shih-Fu Chang.

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. [arxiv]

A pioneering work that proposes the first deep learning framework for temporal action localization in video.

Page updated

Google Sites

Report abuse