Open Science Community - Computer Vision

Computer Vision

👋 Hello, fellow computer vision researchers!

Welcome to the Computer Vision sub-field group of Cohere Labs, where we're all about learning to soar in the world of research. This community is dedicated to sparking a passion for research among enthusiasts of computer vision. Whether you’re just starting or looking to hone your skills, we’re here to support your journey. Together, we can grow and learn more effectively.

🎯 Our Goals

Cultivate a supportive community where questions, ideas, and discussions are always welcome.
Share resources and opportunities to help members grow as researchers.
Organize computer vision events, sessions & workshops to inspire and educate.

📣 How We Communicate

All activity happens on Discord in #computer-vision
Stay tuned for announcements, discussion threads, and event invites.

📅 Logistics & Occurrence

Group meets bi-weekly on Second and Fourth Tuesdays of each month at 11 AM EDT / 08:00 PST / 15:00 GMT / 17:00 CEST / 20:30 IST

📅 Events

CLICK HERE TO ADD ALL PROGRAM EVENT INVITES TO YOUR CALENDAR

Please note: Events can take 24-48 hours to appear on your calendar.

Mayank Bhaskar (@cataluna84)

Working on building a startup in AI/ML research (in-stealth) & providing enterprise services, whose mission is to build the next generation of AI: ubiquitous, interactive intelligence that runs on multimodal data like audio, textual, and video with high throughput and low latency 💡 🧮

Benedict Emoe-Kabu (@harkhymadhe)

Budding data scientist, Tech enthusiast, and a corrosive chemical engineer... all in one!!!

Past Presentations

Jianwei Yang - Principal Researcher at Microsoft Research

Magma: A Foundation Model for Multimodal AI Agents

Jianwei's research journey focuses on the development of multimodal AI agents 🤖, a crucial step toward creating systems that can understand, reason, and interact with the world in human-like ways. Building such agents requires models that not only comprehend multi-sensory observations 👁️👂🧠 but also act adaptively to achieve goals within their environments 🎯

Jianwei explores this grand goal across three key dimensions. First, he investigates how to bridge the gap between core vision understanding and multimodal learning 🌉 through unified frameworks at various granularities. Next, he discusses connecting vision-language models with large language models (LLMs) 🗣️ to create intelligent conversational systems. Finally, Jianwei delves into recent advancements that extend multimodal LLMs into vision-language-action models 🦾, forming the foundation for general-purpose robotics policies.

Together, these endeavors lead to an aspiration of building the next generation of multimodal AI agents capable of seeing, talking, and acting across diverse scenarios in both digital 💻 and physical 🌍 worlds.

Paper 📜, Code & Hugging Face 🤗 Models @ https://microsoft.github.io/Magma/

Leonard Bauersfeld - PhD student at Robotics & Perception Group

Champion level Drone Racing using Deep Reinforcement Learning

FPV drone racing is a sport where pilots navigate high-speed drones via onboard cameras.

In this session, we look at Swift, an AI system using deep RL, that raced and defeated human world champions.

Paper 📜 @ https://www.zora.uzh.ch/id/eprint/257405/

Madhuri Nagare - Camera Algorithm Engineer at Apple

Texture Matching Generative Adversarial Networks (GANs)

Learn how Texture Matching GAN (TMGAN) enhances precision by denoising, sharpening, and separating anatomy from texture using a Siamese network, reducing hallucination risks in AI applications.

Paper 📜 @ https://arxiv.org/abs/2312.13422

Mike Walmsley - Astrophysicist, University of Toronto

Scaling Laws for Galaxy Images

In a presentation, Mike Walmsley framed astronomy as a fundamental computer vision problem centered on interpreting vast quantities of 2D image data to understand the universe. He identified the core challenge as a data bottleneck, noting that while expert annotation is the gold standard for training models, it is an unscalable process.

Walmsley's research cleverly utilizes the Zooniverse "Galaxy Zoo" project, a massive dataset labeled by citizen scientists, to train supervised foundation models. His investigation into neural scaling laws revealed a crucial insight for this domain: performance is currently more limited by the volume of available labeled data than by model size or architecture, underscoring the importance of data-centric AI.

By leveraging the learned latent space, his models enable powerful applications such as similarity searches and personalized anomaly detection. This system effectively creates a recommendation engine for astronomers to discover novel or scientifically interesting galaxies. As Walmsley pointed out, the future of the field lies in moving towards self-supervised and multimodal approaches, like AstroCLIP, to fully exploit the enormous, unlabeled datasets from next-generation telescopes.

Paper 📜 @ https://arxiv.org/abs/2404.02973

Aleksander Shtedritski - PhD Student at Oxford University

Shape-Image Correspondences with no Keypoint Supervision

A groundbreaking framework for unsupervised pose estimation, dubbed SHIC (Shape-Image Correspondences), has been introduced to address a significant bottleneck in the field. Previously, the reliance on data-hungry supervised methods made dense pose estimation for non-human subjects, such as those in the animal kingdom, prohibitively expensive and impractical.

The core insight of the research is the use of powerful latent features from foundation models—specifically a fusion of DINO and Stable Diffusion features called SD-DINO—to establish initial, coarse correspondences. The framework reframes the difficult image-to-3D problem into a more tractable image-to-image matching task by rendering a single 3D template mesh from multiple viewpoints.

These zero-shot matches act as pseudo-ground truth, enabling knowledge distillation into a lightweight, feed-forward model. A key innovation lies in the refinement process, which incorporates strong geometric priors like a cycle consistency loss and an equivariance constraint for object symmetries. This technique refines the noisy initial matches to produce a smooth, accurate pixel-to-vertex mapping.

The results are highly significant. Trained on as few as 200 raw images and a template mesh, SHIC operates without any keypoint supervision and demonstrates performance that surpasses many fully supervised, state-of-the-art methods. This work effectively democratizes dense pose estimation for any deformable object, marking a major leap forward for robotics, animation, and computational biology.

Paper 📜, Demo & Code @ https://www.robots.ox.ac.uk/~vgg/research/shic/

Nico Messikommer - M.Sc in Robotics, Systems and Control from ETH

Data-driven Feature Tracking for Event Cameras with and without Frames

An AI-piloted drone, Swift, has outpaced human world champions in a head-to-head race, a feat powered by a paradigm shift in computer vision: event-based cameras. Standard frame-based cameras are crippled by motion blur and fail in high dynamic range (HDR) or low-light scenarios—critical limitations for agile robotics.

The research introduces a robust, data-driven feature tracking method that synergistically combines standard frames with the high-temporal-resolution, asynchronous data from neuromorphic event cameras. The method leverages a novel Frame Attention Module to enforce spatiotemporal consistency across feature tracks, a crucial step often overlooked. This architecture allows the tracker to maintain stable tracks even during the "blind time" between frames, achieving 2-3x longer track durations than current SOTA methods.

Crucially, the framework demonstrates exceptional robustness against severe motion blur and low-light conditions where traditional methods fail. It is also adaptable to an "events-only" configuration, outperforming existing baselines in that domain. This work paves the way for a new generation of high-speed, low-latency autonomous navigation systems capable of operating in the most challenging real-world environments. The source code is open for the community to build upon.

Paper 📜 @ https://arxiv.org/abs/2211.12826

Apoorv Khandelwal - Analyzing Modular Approaches for Visual Question Decomposition (CV Group) (2024-03-12 08:07 GMT-7)

Apoorv Khandelwal - Analyzing Modular Approaches for Visual Question Decomposition

Lindsey Li - Multimodal Understanding with Large Language Models (CV Group) (2023-10-23 11:06 GMT-7)

Lindsey Li presents Multimodal Understanding with Large Language Models.

C4AI - Maxim Bonnaerens-Learned Threshold Token Merging & Pruning for Vision Transformers (CV Group (2023-09-11 11:02 GMT-7)

Maxim Bonnaerens presents Learned Threshold Token Merging & Pruning for Vision Transformers.

Generating Images with Multimodal LMs with Jing Yu Koh (Regional Asia) (2023-09-06 08:02 GMT-7)

Generating Images with Multimodal LMs with Jing Yu Koh

C4AI - Ahmed Imtiaz Humayun discusses their work on SplineCam (CV Group) (2023-08-28 16:03 GMT-4)

Ahmed Imtiaz Humayun discusses their work on SplineCam

C4AI - Lightly AI and self-supervised learning w Igor Susmelj (CV Group) (2023-08-07 13_03 GMT-7).mp4

C4AI - Muhammad Maaz shares their work on Video-ChatGPT (CV Reading Group) (2023-07-24 16:02 GMT-4)

Muhammad Maaz shares their work on Video-ChatGPT

C4AI - Computer Vision Reading Group welcomes Hila Chefer (2023-07-10 14:03 GMT-4)

Hila Chefer presents their work on explainable Vision Transformer network

C4AI - Computer Vision Reading Group (2023-05-29 14:04 GMT-4)

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

C4AI - Computer Vision Reading Group (2023-04-17 11:03 GMT-4)

Edwin (@sora) presents his work on fine grained recognition.

Sept26: VIsion Transformers101

September 12: Nerf and Pixelnerf-ComputerVision Reading group

C4AI - Computer Vision Reading Group (2023-06-12 16:03 GMT-4)

Back to "Existing Community Programs"

Page updated

Report abuse