Google Scholar

X / Twitter

Featured Latest... 🌱🌷🍃

💎 FoFPred (Future Optical Flow Prediction): Predicting future optical flow turns out to be surprisingly powerful; it boosts both robot control 🤖 and video generation 🎥. I'm thrilled to share the release of 𝗙𝗼𝗙𝗣𝗿𝗲𝗱 (paper/code/checkpoint/interactive demo) 👉 𝗧𝗿𝘆 𝗶𝘁 𝘆𝗼𝘂𝗿𝘀𝗲𝗹𝗳! 👏 Huge shoutout to Kanchana Ranasinghe! See the post & video on LinkedIn / X. 😉

💎 Robotic VLA: VLAs can't just mimic expert trajectories — they need 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝘃𝗲 𝗺𝗼𝘁𝗶𝗼𝗻 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴. Our new work shows that jointly learning motion prediction via optical flow image diffusion gives 𝗥𝗼𝗯𝗼𝘁𝗶𝗰 𝗩𝗟𝗔𝘀 superior ability to reason about what actions to take. The result: stronger, more reliable real-world manipulation - a 23% improvement in real-world performance. Great job, Yu! 👍

💎 Video agent: The old passive video-perception setup just doesn't make sense anymore. Grabbing all visual info once, with fixed granularity and no query awareness, is inefficient and overloads the model. So we built Active Video Perception (AVP) — an agentic, evidence-seeking framework that treats a video like an interactive environment that you actively explore in a goal-directed manner. Check out my LinkedIn post. Excellent work, Ziyang!

🚨 Reminder: Submit your paper to the 𝗠𝘂𝗹𝘁𝗶𝗺𝗼𝗱𝗮𝗹 𝗔𝗹𝗴𝗼𝗿𝗶𝘁𝗵𝗺𝗶𝗰 𝗥𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 𝗪𝗼𝗿𝗸𝘀𝗵𝗼𝗽 at #CVPR 2026! 📅 Submission deadline: March 5, 2026

We welcome both new and previously published work!

📫 honglu.zhou@rutgers.edu OR honglu.zhou@salesforce.com

Bio

Hi, there! 👋 I am a Research Scientist at Salesforce AI Research, Palo Alto, CA, USA.

Previously, I was a Researcher within the Machine Learning Department at NEC Laboratories America, Inc.

I earned my Ph.D. in Computer Science from Rutgers University in 2023, where I was supervised by Professor Mubbasir Kapadia. I received Bachelor of Engineering in Computer Science and Technology in 2017, and Bachelor of Arts in TV Editing and Directing (Post-production of Television) in 2016, both from Communication University of China.

I did internships at Salesforce AI Research (Mentors: Juan Carlos Niebles and Roberto Martín-Martín), NEC Labs (Mentors: Asim Kadav and Farley Lai), and Google YouTube (Mentors: Wei-Hong Chuang and Hassan Akbari) where I was lucky enough to have the opportunity to closely collaborate with DeepMind and Google Research.

I have been organizing multiple editions of the Multimodal Algorithmic Reasoning Workshop in conjunction with ICCV 2023, NeurIPS 2024 and 2025, and CVPR 2024, 2025, and 2026.

Research Interests

Multimodal and Generative AI
Video Understanding
Embodied AI and Robotics

News

February 2026 Two papers have been accepted to CVPR 2026 (Findings)! Congratulations to all co-authors! 🎉
January 2026 Predicting future optical flow turns out to be surprisingly powerful; it boosts both robot control 🤖 and video generation 🎥 from language instructions. I'm thrilled to share the release of 𝗙𝗼𝗙𝗣𝗿𝗲𝗱 (paper/code/checkpoint/interactive demo) 👉 𝗧𝗿𝘆 𝗶𝘁 𝘆𝗼𝘂𝗿𝘀𝗲𝗹𝗳! 👏 Huge shoutout to Kanchana Ranasinghe! 😉 See the post & video on LinkedIn / X.

January 2026 We will organize a new edition of the Multimodal Algorithmic Reasoning Workshop at CVPR 2026.

December 2025 VLAs can't just mimic expert trajectories — they need 𝗽𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝘃𝗲 𝗺𝗼𝘁𝗶𝗼𝗻 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴. Our new work shows that jointly learning motion prediction via image diffusion gives 𝗥𝗼𝗯𝗼𝘁𝗶𝗰 𝗩𝗟𝗔𝘀 superior ability to reason about what actions to take. We released our work, Robotic VLA Benefits from Joint Learning with Motion Image Diffusion. See the X thread. Great job, Yu! 👍

December 2025 The old passive video-perception setup just doesn't make sense anymore. Grabbing all visual info once, with fixed granularity and no query awareness, is inefficient and overloads the model. So we built Active Video Perception (AVP) — an agentic, evidence-seeking framework that treats a video like an interactive environment that you actively explore in a goal-directed manner. Check out my LinkedIn post. Excellent work, Ziyang!

September 2025 🚀🚀🚀 Introducing Strefer—our data engine for Video LLMs! 🎬🧩⏳ It empowers Video LLMs to better interpret specific space-time locations and respond to user queries containing space-time references—whether expressed through text, clicks, or gestures. Here're the videos (10-min version, 3-min version), my post on LinkedIn and walkthrough on X about this work.

August 2025 Contra4 is accepted by EMNLP 2025. Cheers! 🥂 Our post on LinkedIn. 🏆 Leaderboard & Dataset are available now. Can your multimodal models reason across modalities, contrastively?

August 2025 We will organize a new edition of the Multimodal Algorithmic Reasoning Workshop at NeurIPS 2025.

July 2025 I am an invited panelist at the Stanford AI4ALL's Career Panel. Thank you for inviting me again!

June 2025 I am an organizer for the CVPR 2025 Multimodal Algorithmic Reasoning Workshop.

February 2025 ViUniT: Visual Unit Tests for More Robust Visual Programming is accepted by CVPR 2025. Congratulations, Artemis 👍👍👍! Here's my walkthrough. Also, check out this amazing blog post from Juan Carlos (Super cool stuff!!!!).

January 2025 Well, it doesn't make sense for Video LLMs to rely solely on one image encoder for initial feature extraction. Check out the amazing Twitter thread from Tyler Zhu explaining our latest work accepted by ICML 2025 – MERV, a VideoLLM with a multi-encoder representation of videos, and the new capabilities it has gained. 🥳 This project was a wonderful collaboration with Prof. Olga Russakovsky (Princeton) — an experience I'll always cherish!

January 2025 BLIP-3-Video's HuggingFace checkpoints are available now (128 tokens, 32 tokens).

October 2024 We released xGen-MM-Vid (BLIP-3-Video): An Efficient Video-Language Model. Visit our website! 👈

September 2024 Our paper is accepted by WACV 2025. Check the paper: Domain-Guided Weight Modulation for Semi-Supervised Domain Generalization.

August 2024 We released xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations.

August 2024 We released xGen-MM (BLIP-3): A Family of Open Large Multimodal Models. Visit our project page! 👈

August 2024 Dive into our new survey on video self-supervised learning. Explore our organized list of related works and be sure to star our repository. Your contributions are highly valued and greatly appreciated! 🤝

July 2024 I am an invited panelist at the Stanford AI4ALL's Career Panel.

July 2024 We will organize a new edition of the Multimodal Algorithmic Reasoning Workshop at NeurIPS 2024.

June 2024 I am an organizer for the CVPR 2024 Multimodal Algorithmic Reasoning Workshop.

Selected Projects

[CVPR 2026 Finding] Future Optical Flow Prediction Improves Robot Control & Video Generation

Website

In this work, we focus on 𝗙𝘂𝘁𝘂𝗿𝗲 𝗢𝗽𝘁𝗶𝗰𝗮𝗹 𝗙𝗹𝗼𝘄 𝗣𝗿𝗲𝗱𝗶𝗰𝘁𝗶𝗼𝗻 - a language-conditioned, spatially dense motion prediction task. We discuss a 𝘀𝘂𝗶𝘁𝗮𝗯𝗹𝗲 𝗺𝗼𝗱𝗲𝗹 𝗮𝗿𝗰𝗵𝗶𝘁𝗲𝗰𝘁𝘂𝗿𝗲 (a unified VLM-diffusion model) and show 𝗵𝗼𝘄 𝘄𝗲 𝗰𝗮𝗻 𝗵𝗶𝗴𝗵𝗹𝘆-𝘀𝗰𝗮𝗹𝗮𝗯𝗹𝗲 𝘁𝗿𝗮𝗶𝗻 our model with videos from various sources (web-scale human activity videos with noisy camera motion + robot action videos in controlled environments).

FoFPred adapts to both 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲-𝗱𝗿𝗶𝘃𝗲𝗻 𝗿𝗼𝗯𝗼𝘁𝗶𝗰 𝗺𝗮𝗻𝗶𝗽𝘂𝗹𝗮𝘁𝗶𝗼𝗻 and 𝗹𝗮𝗻𝗴𝘂𝗮𝗴𝗲-𝗴𝘂𝗶𝗱𝗲𝗱 𝗺𝗼𝘁𝗶𝗼𝗻 𝘃𝗶𝗱𝗲𝗼 𝗴𝗲𝗻𝗲𝗿𝗮𝘁𝗶𝗼𝗻 - to the best of our knowledge, the first to support such cross-domain versatility. We hope this work paves the way toward generalizable, 𝗺𝗼𝘁𝗶𝗼𝗻-𝗮𝘄𝗮𝗿𝗲 𝘄𝗼𝗿𝗹𝗱 𝗺𝗼𝗱𝗲𝗹𝘀 that understand and act through dynamic visual grounding.

[arXiv 2025] Robotic VLA Benefits from Joint Learning with Motion Image Diffusion

Website

TLDR; We proposed a joint learning strategy with motion image diffusion that enhances VLA models with motion reasoning capabilities, by extending VLA into a dual-head architecture with a DiT-based motion head for language-conditioned optical flow prediction alongside the standard action head.

[CVPR 2026 Finding] Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

Website

📊 𝗔𝗩𝗣 𝗯𝗲𝗮𝘁𝘀 𝗚𝗲𝗺𝗶𝗻𝗶-𝟮.𝟱-𝗣𝗿𝗼 with 𝘂𝗽 𝘁𝗼 𝟳.𝟰% 𝗴𝗮𝗶𝗻𝘀 and 𝟰.𝟯% 𝗮𝗯𝘀𝗼𝗹𝘂𝘁𝗲 𝗮𝗰𝗰𝘂𝗿𝗮𝗰𝘆 𝗯𝗼𝗼𝘀𝘁𝘀 across 7 benchmarking dimensions. Using Gemini-2.5-Flash as the agent also gives 𝟰.𝟴% 𝗮𝘃𝗲𝗿𝗮𝗴𝗲 𝗴𝗮𝗶𝗻𝘀 𝗼𝘃𝗲𝗿 𝗙𝗹𝗮𝘀𝗵 alone.

🏆 𝗔𝗩𝗣 𝗶𝘀 𝗻𝗼𝘄 𝘁𝗵𝗲 𝗯𝗲𝘀𝘁 𝗮𝗴𝗲𝗻𝘁𝗶𝗰 𝗺𝗲𝘁𝗵𝗼𝗱! 🏆 More efficient with better performance! [see post]

[ICCV 2025 Workshop] Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Website

[EMNLP 2025] Contra4: Evaluating Contrastive Cross-Modal Reasoning in Audio, Video, Image, and 3D

Website

[CVPR 2025] ViUniT: Visual Unit Tests for More Robust Visual Programming

Website

[ICML 2025] MERV: Unifying Specialized Visual Encoders for Video Language Models

Website

[arXiv 2024] xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

Website

[ECCV 2024 AI4VA] xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

Paper

[arXiv 2024] [Oral @ ICCV 2025 Workshop] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Website

[CVPR 2024] Learning from Synthetic Human Group Activities

Website

[CVPR 2024] Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

GitHub

[DICTA 2024 Best Paper Award & Oral] Matching Confidences and Softened Target Occurrences for Calibration

Paper

[CVPR 2023] Procedure-Aware Pretraining for Instructional Video Understanding

Website

[ECCV 2022] COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

GitHub

[CIKM 2022 Best Paper Award] D-HYPR: Harnessing Neighborhood Modeling and Asymmetry Preservation for Digraph Representation Learning

GitHub

[ICLR 2021] Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

GitHub

The links to the CATER-h dataset mentioned in the GitHub repository are broken. Here are the links that should contain the dataset: [link1, link2].

Coffee, nature, good sleep, physical and mental balance, family and friends, and cuddles with my dog and cat—all my sources of happiness. ❤️🫶🏻

Last Update: February, 2026