Cultivate a supportive community where questions, ideas, and discussions are always welcome.
Share resources and opportunities to help members grow as researchers.
Organize computer vision events, sessions & workshops to inspire and educate.
All activity happens on Discord in #computer-vision
Stay tuned for announcements, discussion threads, and event invites.
Working on building a startup in AI/ML research (in-stealth) & providing enterprise services, whose mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs on multimodal data like audio, textual, and video with high throughput and low latency 💡 🧮
Past Presentations
Jianwei Yang - Principal Researcher at Microsoft Research
Jianwei's research journey focuses on the development of multimodal AI agents 🤖, a crucial step toward creating systems that can understand, reason, and interact with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations 👁️👂🧠 but also act adaptively to achieve goals within their environments 🎯
Jianwei explores this grand goal across three key dimensions. First, he investigates how to bridge the gap between core vision understanding and multimodal learning 🌉 through unified frameworks at various granularities. Next, he discusses connecting vision-language models with large language models (LLMs) 🗣️ to create intelligent conversational systems. Finally, Jianwei delves into recent advancements that extend multimodal LLMs into vision-language-action models 🦾, forming the foundation for general-purpose robotics policies.
Together, these endeavors lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital 💻 and physical 🌍 worlds.
Paper 📜, Code & Hugging Face 🤗 Models @ https://microsoft.github.io/Magma/
Leonard Bauersfeld - PhD student at Robotics & Perception Group
FPV drone racing is a sport where pilots navigate high-speed drones via onboard cameras.
In this session, we look at Swift, an AI system using deep RL, that raced and defeated human world champions.
Madhuri Nagare - Camera Algorithm Engineer at Apple
Learn how Texture Matching GAN (TMGAN) enhances precision by denoising, sharpening, and separating anatomy from texture using a Siamese network, reducing hallucination risks in AI applications.
Paper 📜 @ https://arxiv.org/abs/2312.13422
Mike Walmsley - Astrophysicist, University of Toronto
In a presentation, Mike Walmsley framed astronomy as a fundamental computer vision problem centered on interpreting vast quantities of 2D image data to understand the universe. He identified the core challenge as a data bottleneck, noting that while expert annotation is the gold standard for training models, it is an unscalable process.
Walmsley's research cleverly utilizes the Zooniverse "Galaxy Zoo" project, a massive dataset labeled by citizen scientists, to train supervised foundation models. His investigation into neural scaling laws revealed a crucial insight for this domain: performance is currently more limited by the volume of available labeled data than by model size or architecture, underscoring the importance of data-centric AI.
By leveraging the learned latent space, his models enable powerful applications such as similarity searches and personalized anomaly detection. This system effectively creates a recommendation engine for astronomers to discover novel or scientifically interesting galaxies. As Walmsley pointed out, the future of the field lies in moving towards self-supervised and multimodal approaches, like AstroCLIP, to fully exploit the enormous, unlabeled datasets from next-generation telescopes.
Paper 📜 @ https://arxiv.org/abs/2404.02973
Aleksander Shtedritski - PhD Student at Oxford University
A groundbreaking framework for unsupervised pose estimation, dubbed SHIC (Shape-Image Correspondences), has been introduced to address a significant bottleneck in the field. Previously, the reliance on data-hungry supervised methods made dense pose estimation for non-human subjects, such as those in the animal kingdom, prohibitively expensive and impractical.
The core insight of the research is the use of powerful latent features from foundation models—specifically a fusion of DINO and Stable Diffusion features called SD-DINO—to establish initial, coarse correspondences. The framework reframes the difficult image-to-3D problem into a more tractable image-to-image matching task by rendering a single 3D template mesh from multiple viewpoints.
These zero-shot matches act as pseudo-ground truth, enabling knowledge distillation into a lightweight, feed-forward model. A key innovation lies in the refinement process, which incorporates strong geometric priors like a cycle consistency loss and an equivariance constraint for object symmetries. This technique refines the noisy initial matches to produce a smooth, accurate pixel-to-vertex mapping.
The results are highly significant. Trained on as few as 200 raw images and a template mesh, SHIC operates without any keypoint supervision and demonstrates performance that surpasses many fully supervised, state-of-the-art methods. This work effectively democratizes dense pose estimation for any deformable object, marking a major leap forward for robotics, animation, and computational biology.
Paper 📜, Demo & Code @ https://www.robots.ox.ac.uk/~vgg/research/shic/
Nico Messikommer - M.Sc in Robotics, Systems and Control from ETH
An AI-piloted drone, Swift, has outpaced human world champions in a head-to-head race, a feat powered by a paradigm shift in computer vision: event-based cameras. Standard frame-based cameras are crippled by motion blur and fail in high dynamic range (HDR) or low-light scenarios—critical limitations for agile robotics.
The research introduces a robust, data-driven feature tracking method that synergistically combines standard frames with the high-temporal-resolution, asynchronous data from neuromorphic event cameras. The method leverages a novel Frame Attention Module to enforce spatiotemporal consistency across feature tracks, a crucial step often overlooked. This architecture allows the tracker to maintain stable tracks even during the "blind time" between frames, achieving 2-3x longer track durations than current SOTA methods.
Crucially, the framework demonstrates exceptional robustness against severe motion blur and low-light conditions where traditional methods fail. It is also adaptable to an "events-only" configuration, outperforming existing baselines in that domain. This work paves the way for a new generation of high-speed, low-latency autonomous navigation systems capable of operating in the most challenging real-world environments. The source code is open for the community to build upon.
Paper 📜 @ https://arxiv.org/abs/2211.12826
Apoorv Khandelwal - Analyzing Modular Approaches for Visual Question Decomposition
Lindsey Li presents Multimodal Understanding with Large Language Models.
Maxim Bonnaerens presents Learned Threshold Token Merging & Pruning for Vision Transformers.
Generating Images with Multimodal LMs with Jing Yu Koh
Ahmed Imtiaz Humayun discusses their work on SplineCam
Muhammad Maaz shares their work on Video-ChatGPT
Hila Chefer presents their work on explainable Vision Transformer network
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Edwin (@sora) presents his work on fine grained recognition.