I’ll be at ICCV 2025 in Honolulu, Hawai'i this October. Let me know if you’d like to connect or catch up!
I’ll be at ICCV 2025 in Honolulu, Hawai'i this October. Let me know if you’d like to connect or catch up!
I did internships at Salesforce AI Research (Mentors: Juan Carlos Niebles and Roberto Martín-Martín), NEC Labs (Mentors: Asim Kadav and Farley Lai), and Google YouTube (Mentors: Wei-Hong Chuang and Hassan Akbari) where I was lucky enough to have the opportunity to closely collaborate with DeepMind and Google Research.
Multimodal and Generative AI
Video Understanding
Embodiment and Robotics
Open to academic collaboration - Contact me if you're interested.
September 2025 🚀🚀🚀 Introducing Strefer—our data engine for Video LLMs! 🎬🧩⏳ It empowers Video LLMs to better interpret specific space-time locations and respond to user queries containing space-time references—whether expressed through text, clicks, or gestures. Here're the videos (10-min version, 3-min version), my post on LinkedIn and walkthrough on X about this work.
August 2025 Contra4 is accepted by EMNLP 2025. Cheers! 🥂 Our post on LinkedIn. 🏆 Leaderboard & Dataset are available now. Can your multimodal models reason across modalities, contrastively?
August 2025 We will organize a new edition of the Multimodal Algorithmic Reasoning Workshop at NeurIPS 2025.
July 2025 I am an invited panelist at the Stanford AI4ALL's Career Panel. Thank you for inviting me again!
June 2025 I am an organizer for the CVPR 2025 Multimodal Algorithmic Reasoning Workshop.
February 2025 ViUniT: Visual Unit Tests for More Robust Visual Programming is accepted by CVPR 2025. Congratulations, Artemis 👍👍👍! Here's my walkthrough. Also, check out this amazing blog post from Juan Carlos (Super cool stuff!!!!).
January 2025 Well, it doesn't make sense for Video LLMs to rely solely on one image encoder for initial feature extraction. Check out the amazing Twitter thread from Tyler Zhu explaining our latest work accepted by ICML 2025 – MERV, a VideoLLM with a multi-encoder representation of videos, and the new capabilities it has gained. 🥳
January 2025 BLIP-3-Video's HuggingFace checkpoints are available now (128 tokens, 32 tokens).
October 2024 We released xGen-MM-Vid (BLIP-3-Video): An Efficient Video-Language Model. Visit our website! 👈
September 2024 Our paper is accepted by WACV 2025. Check the paper: Domain-Guided Weight Modulation for Semi-Supervised Domain Generalization.
August 2024 We released xGen-VideoSyn-1: High-fidelity Text-to-Video Synthesis with Compressed Representations.
August 2024 We released xGen-MM (BLIP-3): A Family of Open Large Multimodal Models. Visit our project page! 👈
August 2024 Dive into our new survey on video self-supervised learning. Explore our organized list of related works and be sure to star our repository. Your contributions are highly valued and greatly appreciated! 🤝
July 2024 I am an invited panelist at the Stanford AI4ALL's Career Panel.
July 2024 We will organize a new edition of the Multimodal Algorithmic Reasoning Workshop at NeurIPS 2024.
June 2024 I am an organizer for the CVPR 2024 Multimodal Algorithmic Reasoning Workshop.
More about me & Fun stuff...
Coffee, nature, family and friends, cuddles with my dog and cat — my sources of happiness.