📫
honglu [dot] zhou [at] rutgers [dot] edu


Biography

I am a Research Scientist at Salesforce AI Research, Palo Alto, CA, USA. 

Previously, I was a Researcher within the Machine Learning Department at NEC Laboratories America, Inc.

I earned my Ph.D. in Computer Science from Rutgers University in 2023, where I was supervised by Professor Mubbasir Kapadia. I received Bachelor of Engineering in Computer Science and Technology in 2017, and Bachelor of Arts in TV Editing and Directing (Post-production of Television) in 2016, both from Communication University of China

I did internships at Salesforce AI Research (Mentors: Juan Carlos Niebles and Roberto Martín-Martín), NEC Labs (Mentors: Asim Kadav and Farley Lai), and Google YouTube (Mentors: Wei-Hong Chuang and Hassan Akbari) where I was lucky enough to have the opportunity to closely collaborate with DeepMind and Google Research

Research Interests

📣 We will be hosting Multimodal Algorithmic Reasoning Workshop @ CVPR 2025

News

I’ll be at ICLR 2025 in Singapore this April. Let me know if you’d like to connect or catch up!

January 2025    BLIP-3-Video's HuggingFace checkpoints are available now (128 tokens, 32 tokens).

January 2025    Well, it doesn't make sense for VideoLLMs to rely solely on an image encoder for initial feature extraction. Check out the amazing Twitter thread from Tyler Zhu explaining our latest work – MERV, a VideoLLM with a multi-encoder representation of videos, and the new capabilities it has gained. 🥳

October 2024    We released xGen-MM-Vid (BLIP-3-Video): An Efficient Video-Language Model. Visit our website! 👈

September 2024    Our paper is accepted by WACV 2025. Check the paper: Domain-Guided Weight Modulation for Semi-Supervised Domain Generalization

August 2024    We released xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations.

August 2024    We released xGen-MM (BLIP-3): A Family of Open Large Multimodal Models. Visit our project page! 👈

August 2024    Dive into our new survey on video self-supervised learning. Explore our organized list of related works and be sure to star our repository. Your contributions are highly valued and greatly appreciated! 🤝 

July 2024        I am an invited panelist at the Stanford AI4ALL's Career Panel.

July 2024         We will organize a new edition of the Multimodal Algorithmic Reasoning Workshop at NeurIPS 2024

June 2024       I am an organizer for the CVPR 2024 Multimodal Algorithmic Reasoning Workshop

Selected Projects

[arXiv 2025] MERV: Unifying Specialized Visual Encoders for Video Language Models

[arXiv 2024] xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs

[ECCV 2024 AI4VA] xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations

[arXiv 2024] xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

[CVPR 2024] Learning from Synthetic Human Group Activities

[CVPR 2024] Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos

[CVPR 2023] Procedure-Aware Pretraining for Instructional Video Understanding

[ECCV 2022] COMPOSER: Compositional Reasoning of Group Activity in Videos with Keypoint-Only Modality

[CIKM 2022 Best Paper Award] D-HYPR: Harnessing Neighborhood Modeling and Asymmetry Preservation for Digraph Representation Learning

[ICLR 2021] Hopper: Multi-hop Transformer for Spatiotemporal Reasoning

Love nature! Enjoy hiking, traveling, going to the beach or park, and spending time with family and friends.

Coffee, friends and my sweet dog and cat are my sources of happiness.





 Last Update: Feb 1, 2025