Peter's Blog

Computer Vision for Metaverse Workshop @ ECCV 2022

Computer Vision (CV) research plays an essential role in enabling the future applications of Augmented Reality (AR), Virtual Reality (VR), and Mixed Reality (MR), which are nowadays referred to as the Metaverse. Building the Metaverse requires CV technologies to better understand people, objects, scenes, the world around us, and better render contents in more immersive and realistic ways. This brings new problems to CV research and inspires us to look at existing CV problems from new perspectives. As the general public grows interest and industry put more efforts in Metaverse, we think it is a good opportunity to organize a workshop for the computer vision community to get together to showcase our latest research, discuss new directions and problems, and influence the future trajectory of Metaverse research and applications.

Link

D2Go brings Detectron2 to mobile

Detectron2, released by Facebook AI Research (FAIR) in 2019, gives developers an easy path to plugging custom modules into any object detection system. Today, the Mobile Vision team at Facebook Reality Labs (FRL) is expanding on Detectron2 with the introduction of Detectron2Go (D2Go), a new, state-of-the-art extension for training and deploying efficient deep learning object detection models on mobile devices and hardware. D2Go is built on top of Detectron2, PyTorch Mobile, and TorchVision. It’s the first tool of its kind, and it will allow developers to take their machine learning models from training all the way to deployment on mobile.

Blog Post, Github,

FBNetV3: Joint Architecture-Recipe Search using Neural Acquisition Function

Neural Architecture Search (NAS) yields state-of-the-art neural networks that outperform their best manually-designed counterparts. However, previous NAS methods search for architectures under one training recipe (i.e., training hyperparameters), ignoring the significance of training recipes and overlooking superior architectures under other training recipes. Thus, they fail to find higher-accuracy architecture-recipe combinations. To address this oversight, we present JointNAS to search both (a) architectures and (b) their corresponding training recipes. To accomplish this, we introduce a neural acquisition function that scores architectures and training recipes jointly. Following pre-training on a proxy dataset, this acquisition function guides both coarse-grained and fine-grained searches to produce FBNetV3. FBNetV3 is a family of state-of-the-art compact ImageNet models, outperforming both automatically and manually-designed architectures. For example, FBNetV3 matches both EfficientNet and ResNeSt accuracy with 1.4x and 5.0x fewer FLOPs, respectively. Furthermore, the JointNAS-searched training recipe yields significant performance gains across different networks and tasks.

Paper

Visual Transformers: Token-based Image Representation and Processing for Computer Vision

Computer vision has achieved great success using standardized image representations -- pixel arrays, and the corresponding deep learning operators -- convolutions. In this work, we challenge this paradigm: we instead (a) represent images as a set of visual tokens and (b) apply visual transformers to find relationships between visual semantic concepts. Given an input image, we dynamically extract a set of visual tokens from the image to obtain a compact representation for high-level semantics. We then use visual transformers to operate over the visual tokens to densely model relationships between them. We find that this paradigm of token-based image representation and processing drastically outperforms its convolutional counterparts on image classification and semantic segmentation. To demonstrate the power of this approach on ImageNet classification, we use ResNet as a convenient baseline and use visual transformers to replace the last stage of convolutions. This reduces the stage's MACs by up to 6.9x, while attaining up to 4.53 points higher top-1 accuracy. For semantic segmentation, we use a visual-transformer-based FPN (VT-FPN) module to replace a convolution-based FPN, saving 6.5x fewer MACs while achieving up to 0.35 points higher mIoU on LIP and COCO-stuff.

Paper

FBNetV2: Differentiable Neural Architecture Search for Spatial and Channel Dimensions

Differentiable Neural Architecture Search (DNAS) has demonstrated great success in designing state-of-the-art, efficient neural networks. However, DARTS-based DNAS's search space is small when compared to other search methods', since all candidate network layers must be explicitly instantiated in memory. To address this bottleneck, we propose a memory and computationally efficient DNAS variant: DMaskingNAS. This algorithm expands the search space by up to 10^14× over conventional DNAS, supporting searches over spatial and channel dimensions that are otherwise prohibitively expensive: input resolution and number of filters. We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands. Furthermore, we employ effective shape propagation to maximize per-FLOP or per-parameter accuracy. The searched FBNetV2s yield state-of-the-art performance when compared with all previous architectures. With up to 421× less search cost, DMaskingNAS finds models with 0.9% higher accuracy, 15% fewer FLOPs than MobileNetV3-Small; and with similar accuracy but 20% fewer FLOPs than Efficient-B0. Furthermore, our FBNetV2 outperforms MobileNetV3 by 2.6% in accuracy, with equivalent model size.

Paper, Code

Turning any 2D photo into 3D using convolutional neural nets

Our 3D Photos feature on Facebook launched in 2018 as a new, immersive format for sharing pictures with friends and family. The feature has relied on the dual-lens “portrait mode” capabilities available only in new, higher-end smartphones, however. So it hasn’t been available on typical mobile devices, which have only a single, rear-facing camera. To bring this new visual format to more people, we have used state-of-the-art machine learning techniques to produce 3D photos from virtually any standard 2D picture. This system infers the 3D structure of any image, whether it is a new shot just taken on an Android or iOS device with a standard single camera, or a decades-old image recently uploaded to a phone or laptop.

Project URL

Hand pose estimation

Researchers and engineers from Facebook Reality Labs and Oculus have developed what is, as of today, the only fully articulated hand-tracking system for VR that relies entirely on monochrome cameras. The system does not use active depth-sensing technology or any additional equipment (such as instrumented gloves). We will deploy this technology as a software update for Oculus Quest, the cable-free, stand-alone VR headset that is now available to consumers.

Project URL

Automatic Neural Architecture Search

Designing accurate and efficient ConvNets for mobile devices is challenging because the design space is combinatorially large. Due to this, previous neural architecture search (NAS) methods are computationally expensive. ConvNet architecture optimality depends on factors such as input resolution and target devices. However, existing approaches are too expensive for case-by-case redesigns. Also, previous work focuses primarily on reducing FLOPs, but FLOP count does not always reflect actual latency. To address these, we propose a differentiable neural architecture search (DNAS) framework that uses gradient-based methods to optimize ConvNet architectures, avoiding enumerating and training individual architectures separately as in previous methods. FBNets, a family of models discovered by DNAS surpass state-of-the-art models both designed manually and generated automatically.

Project URL

Facebook Portal

Portal and Portal+ are video calling devices for the home leverage the latest in computer vision research to understand how people move, and connect. Portal's Smart Camera stays with the action, automatically panning and zooming to keep everyone in view.

Project URL

Instagram's Nametag

Nametag is a customizable identification card that allows people to find your Instagram profile when it’s scanned. Your nametag is uniquely yours and makes it quick and fun to add people and accounts you discover in person.

Project URL

Enabling full body AR with Mask R-CNN2Go

We recently developed a new technology that can accurately detect body poses and segment a person from their background. Our model is still in research phase at the moment, but it is only a few megabytes, and can run on smart phones in real time. Someday, it could enable new applications many new applications such as creating body masks, using gestures to control games, or de-identifying people.

Project URL

Facebook real-time Style Transfer and Caffe2Go

As video becomes an even more popular way for people to communicate, we want to give everyone state-of-the art creative tools to help you express yourself. We recently began testing a new creative-effect camera in the Facebook app that helps people turn videos into works of art in the moment. That technique is called “style transfer.” It takes the artistic qualities of one image style, like the way Van Gogh paintings look, and applies it to other images and videos. It's a technically difficult trick to pull off, normally requiring the content to be sent off to data centers for processing on big-compute servers — until now. We've developed a new deep learning platform on mobile so it can — for the first time — capture, analyze, and process pixels in real time, putting state-of-the-art technology in the palm of your hand. This is a full-fledged deep learning system called Caffe2Go, and the framework is now embedded into our mobile apps. By condensing the size of the AI model used to process images and videos by 100x, we're able to run various deep neural networks with high efficiency on both iOS and Android. Ultimately, we were able to provide AI inference on some mobile phones at less than 1/20th of a second, essentially 50 ms — a human eye blink happens at 1/3rd of a second or 300 ms.

Project URL

Facebook accessibility for users with visual impairments

Facebook’s artificial intelligence systems now report more offensive photos than humans do. AI could quarantine obscene content before it ever hurts the psyches of real people.

Project URL

Virtual Reality - 360 3D 8K VR video capturing

As my hobby I captured my first 360 3D 8K video from Eszterlanc Hungarian Folk Dance Group at Golden Gate Park, San Francisco.

Computer Vision at Facebook


Personalized TeleVision News (PTVN)

In this project, we seek to develop and demonstrate a platform for personalized television news to replace the traditional one-broadcast-fits-all model. We forecast that next-generation video news consumption will be more personalized, device agnostic, and pooled from many different information sources. The technology for our project represents a major step in this direction, providing each viewer with a personalized newscast with stories that matter most to them. We believe that such a model can provide a vastly superior user experience and provide fine-grained analytics to content providers. While personalized viewing is increasingly popular for text-based news, personalized real-time video news streams are a critically missing technology.

Details - Demo page - Android applciation

Storytelling with Augmented Reality (STAR)

STAR will experiment with using augmented reality software on mobile devices in combination with location- and viewpoint-aware storytelling. The group hopes to foster more interactive and immersive storytelling by displaying a video stream of virtual content that overlaps with live images of the physical world as viewed on a mobile device.

Tag propagation

In the past few years sharing photos within social networks has become very popular. In order to make these huge collections easier to explore, images are usually tagged with representative keywords such as persons, events, objects, and locations. In order to speed up the time-consuming tag annotation process, tags can be propagated based on the similarity between image content and context. [In Tags We Trust]

Trust Modeling in Social Media Tagging

One important challenge in tagging is to identify most appropriate tags for given content, and at the same time, to eliminate noisy or spam tags. The shared content is sometimes assigned with inappropriate tags for several reasons. First of all, users are human beings and may commit mistakes. Moreover, it is possible to provide wrong tags on purpose for advertisement, self-promotion, or to increase the rank of a particular tag in automatic search engines. Consequently, assigning free-form keywords (tags) to multimedia content has a risk that wrong or irrelevant tags eventually prevent users from the benefits of annotated content. Some studies analyzed the Flickr website and revealed that the tags provided by users are often imprecise and only around 50% of tags are truly related to an image. Beside the tag-content association, spam objects can take other forms, i.e. possibly manifesting as a spam content or a spam user (spammer).

Swiss Cheese - An advanced image management platform

Multimedia Signal Processing Group at EPFL (http://mmspg.epfl.ch) has developed Swiss Cheese, an advanced image management platform for online use and mobile devices (http://cheese.epfl.ch). Beside standard features such as image upload, tagging and keyword based search, it offers the user visual similarity based search, object based tagging and semi-automatic tag propagation. For improved interoperability between different image repositories and applications, the platform supports the export and import of image files with embedded metadata in JPSearch - Part 4 compliant format..

Museum Guide

Indoor mobile museum guide application. This museum guide application was developed for the Olympic Museum of Lausanne and is based on image recognition. The application provides audiovisual information concerning the exhibits of the museum and its goal is to make the visit to the museum more interactive and enjoyable