Michael Sapienza

[mikesapi@gmail.com - Google Scholar - GitHub - LinkedIn - YouTube - X.com]

I build systems that work in the real world. I am driven by finding out how things work, figuring out why things don't work, and working to push the boundary of what's currently possible. I am fascinated by perception and intelligence; how we see, hear, smell, taste, touch, think, learn, act, and generate an understanding of the world around us. My research interests span computer vision, machine learning, robotics & AI, and I love building interactive applications that bring these fields to life.

Bio

2021-2025

Founding member and VP of AI & ML. Drove and built core technology in conversational AI, multimodal search, multilingual LLMs & avatars at TWO AI, which reached millions. [Zappy, Geniya, AutoCameo, Sutra Multilingual]

2017-2021

Joined the Samsung Think Tank Team ('17-'19) working on next generation devices, especially real-time interactive AR on phone (S8/S9, 5 patents, US & worldwide), then helped start Samsung's STAR Labs ('19-'21) NEON moonshot project, and created digital human tech showcased at CES2020; live demo: YouTube.

2014-2017

Joined Univ. of Oxford, Wolfson College for a Post-doc with the Torr Vision Group. I focused on real-time interactive learning and labelling in 3D scenes [Siggraph ETech, GitHub, BBC News, BMVC], human action detection [ICCV, BMVC, ACML], and instance segmentation. [CVPR, YouTube, GitHub, STS++]

2011-2014

Pursued a PhD in Oxford with the Torr Vision Group under Prof. Fabio Cuzzolin and Prof. Philip H.S. Torr at Oxford Brookes Univ. Worked on human action recognition and detection in large video databases, with limited training labels (weak supervision), critical for real-world deployment where labels are expensive. Examined by Prof. Andrew Zisserman and Dr. Tjeerd Olde Scheper. Check out a talk I gave, YouTube, publications: BMVC, IJCV, PAMI.

2010-2011

Completed an M.Sc. by research at Univ. of Malta in mobile robotic vision under Prof. Kenneth P. Camilleri. Developed a monocular vision algorithm to help a robot guide itself without hitting obstacles in a previously unknown environment. Check out: GitHub - YouTube - Times of Malta.

2009

Interned with the PERCEPTION team at INRIA Grenoble, under Dr. Miles Hansard and Prof. Radu Horaud. Addressed the real-time calibration update of an active binocular robot with verging cameras. See: Journal, YouTube, RoboHub.

2005-2009

B.Eng. at the Univ. of Malta where I first became hooked on computer vision with a project on "head pose estimation" under Prof. Kenneth P. Camilleri. Check out: GitHub, YouTube.

Building & Research

SutraAvatarV2. Generating high quality avatar system for text to talking head from a single image frame. Video shows a selection of T2V generations using the SutraAvatarV2 HuggingFace Space in late 2024. Based on an implicit keypoint method and custom designed audio to latent speech embedding neural model, as well as a behavior generator to allow audio to talking head from just one input image and audio file. Generates videos 90x faster (10s audio, 20s to generate) than diffusion based methods (10s audio, ~30min to generate).

AutoCameo. A service that automates personalized video messages from celebrities to fans at scale. In partnership with Jio Platforms during the 2023 IPL, AutoCameo enabled cricket fans to receive personalized videos from stars like Rohit Sharma and Virat Kohli (~300k users, reached millions of fans). The technology used TWO's HALO-2 engine to generate high-fidelity AI videos with minimal delay. [YouTube]

Multilingual LLMs. (Late 2023) One thing we noticed was how poor existing LLMs were (GPT-3.5 and Multilingual LLMs like Aya, OkApi, BLOOM etc.) using non-English languages. Most LLMs and tokenizers were mostly trained in English (see tokenizer comparison). Sutra Multilingual architecture was designed to address this gap. It decoupled concept learning from language learning, inspired by human cognition (see: LeCun, Gibson), and treats languages like modalities, similarly to how multi-modal LLMs project audio/images to the LLMs embedding space. This enabled adding new languages without re-training the base concept model.

Conversational AI & Search: Zappy. AI Messaging app with personality-driven chatbots. Long-term memory, no blind spots, contextual conversations. 20M+ messages sent. [zappy.ai]. Geniya. AI-powered search engine. User query understanding, search optimization, context management and Agentic design patterns such as plan-act, prompt-chaining, self-correction. State-of-the-art on FreshQA dataset (March 2024). [YouTube]

Neon @ CES2020. At STAR Labs (Samsung's moonshot project), I worked on Neon - photorealistic digital humans with real-time behavior synthesis. There was a lot going on behind the scenes, facial expression modeling, photorealistic rendering, speech synchronization, behavior generation. We had extensive media coverage and showed the world how we could be interacting with AI in the future. [CNN, CNBC, Bloomberg, TechRadar, Digital Trends]

Phone/Headset AR. This was my transition from research to production-constrained R&D - solving hard problems with real hardware limitations. Built DVS camera algorithms for AR applications; the full system from frame capture, synchronization, calibration, sensor fusion, visual-inertial SLAM, structure classification and simplification, fusion and apps. Optimized to run in real-time (>20fps) on S8/S9 phone. [Patents: 1, 2, 3, 4, 5]

Real-time instance segmentation. First system to predict instance segmentation masks (not just boxes) in real-time. State-of-the art was YOLO which outputs boxes, but boxes can only provide location, scale and aspect ratio. (Imagine seeing the world in this way, you only 'see' box, label, aspect ratio). Shape allows higher order reasoning about object pose, function and similarity to other objects. One of the most interesting things we found was that teaching object detectors about shape lets you predict reasonable categories/shapes for objects they're never seen before, making reasonable predictions that state-of-the-art methods got completely wrong. [CVPR, YouTube, GitHub, STS++]

SemanticPaint: Interactive Segmentation and Learning of 3D Worlds. Most object detectors can't recognize things in your personal space, they're only trained on standard datasets with fixed categories. SemanticPaint flips this around: it lets you teach the system to recognize objects around you in real-time as you point them out. We combined state-of-the-art techniques with several optimizations to create an interactive system you could actually use in the real world. At a time when most research showed cherry-picked results on curated datasets, building a working system proved what techniques truly worked. [Siggraph ETech, GitHub, BBC News, BMVC]

Human Action Recognition. One of the most interesting aspects of this work was training machine learning models on large video datasets, without having bounding boxes typically required during training (weakly-labelled) [BMVC, IJCV]. This is possible in offline scenarios where videos have already been captured. In the online scenario where you have frames coming in continuously, we proposed a different method to incrementally build space-time action tubes and solving the multi-label and multi-instance problems incrementally and jointly. [BMVC, ICCV]

Active Binocular Vision. Usually binocular robots have two parallel cameras, and once calibrated you can't move them! Here we proposed to update the calibration in real-time whilst the cameras are moving (verging) for 3D reconstruction. Why? You need overlap between cameras and correspondences to calculate depth. But as objects get closer they quickly move outside the overlap zone, making stereo matching impossible. Vergence dynamically maximizes the overlapping region for objects and makes stereo matching easier because when objects are centered in the image, the disparity is smaller and it constrains the search space. [Journal, YouTube, RoboHub]

Monocular robot guidance. Imagine a robot, knows nothing about the world around it. It gets switched on and has to figure where it can move to with only one camera. This was my first project on online continual learning. It could roam around the faculty unaided, without a map, avoiding obstacles. [Github - YouTube - Times of Malta]

Real-time head pose estimation. This was my first computer vision project, and it was incredible to control a game just by moving your head. Being able to interact with software through natural movements rather than a keyboard or mouse felt like magic. [TechR - GitHub]

Smaller side projects

ToyBlocks (Blockchain). Trying to understand something with too many buzzwords!? Implement (parts) of it. This toy example uses lines from Shakespeare as input data and is designed to provide a concrete understanding of the basic ideas that make a blockchain what it is: an unchangeable array of data (immutable database). [GitHub]

Watch Random Forest learn. What happens when a classifier sees a training dataset over time (i.e it cannot see a uniformly distributed set from the beginning) and needs to update itself later. Or when changing environment conditions require it to re-learn a class, without starting from scratch? [RF: Github]

Touch Detection. Getting a 3D surface touch detector (without hand tracking) to work in under 5ms. Uses depth differencing and random-forest based filter.

Watch SVM learn. Watching many SVM classifiers learning and adapting to a changing training set - cool.

Audio Power Meter. In this case keeping electronic design simple gave the audio power meter a real nice bounce.

If you've come down this far..

Bio

Don't click on this link: Archive

Google Sites

Report abuse