Vittorio Ferrari

Senior Staff Research Scientist at Google

Honorary Professor at the University of Edinburgh

Research interests

My research field is computer vision. I currently work on two primary areas: transfer learning and 3D Deep Learning.

In the past I have worked on several other computer vision problems, including human-machine collaborative annotation, action recognition, human pose estimation, learning visual attributes, shape matching, contour-based object class detection, specific object recognition, multi-view wide-baseline stereo, tracking in video.

On this page you can find some examples of recent project. See the publication page for may more!

RayTran: 3D reconstruction of multiple objects from videos with ray-traced transformers

RayTran is a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. It relies on two alternative ways to represent its knowledge: as a global 3D grid of features and an array of view-specific 2D grids. We progressively exchange information between the two with a dedicated bidirectional attention mechanism. We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix, making our architecture feasible on current hardware, both in terms of memory and computation. We attach a DETR-style head on top of the 3D feature grid in order to detect the objects in the scene and to predict their 3D pose and 3D shape. Compared to previous methods, our architecture is single stage, end-to-end trainable, and it can reason holistically about a scene from multiple video frames without needing a brittle tracking step. On the challenging Scan2CAD dataset, we outperform (1) recent state-of-the-art methods (Vid2CAD, ODAM) for 3D object pose estimation from RGB videos; and (2) a strong alternative method combining Multi-view Stereo with RGB-D CAD alignment.

Urban Radiance Fields

The goal of this work is to perform 3D reconstruction and novel view synthesis from data captured by scanning platforms commonly deployed for world mapping in urban outdoor environments (e.g., Street View). Given a sequence of posed RGB images and lidar sweeps acquired by cameras and scanners moving through an outdoor scene, we produce a model from which 3D surfaces can be extracted and novel RGB images can be synthesized. Our approach extends Neural Radiance Fields, which has been demonstrated to synthesize realistic novel images for small scenes in controlled settings, with new methods for leveraging asynchronously captured lidar data, for addressing exposure variation between captured images, and for leveraging predicted image segmentations to supervise densities on rays pointing at the sky. Each of these three extensions provides significant performance improvements in experiments on Street View data. Our system produces state-of-the-art 3D surface reconstructions and synthesizes higher quality novel views in comparison to both traditional methods (e.g.~COLMAP) and recent neural representations (e.g.~Mip-NeRF).

Transferabilty Metrics

Transfer learning has become a popular method for leveraging pre-trained models in computer vision. However, without performing computationally expensive fine-tuning, it is difficult to quantify which pre-trained source models are suitable for a specific target task, or, conversely, to which tasks a pre-trained source model can be easily adapted to. Transferability metrics try to evaluate the fit of a source model for a particular target. We proposed two new transferability metrics:

  1. Gaussian Bhattacharyya Coefficient, where we embed all target images in the feature space defined by the source model, and represent them with per-class Gaussians. Then, we estimate their pairwise class separability using the Bhattacharyya coefficient, yielding a simple and effective measure of how well a source model transfers to a target task.

  2. Ensemble Transferability Metrics for Semantic Segmentation. We propose several new transferability metrics designed to select an ensemble of source models which, after fine-tuning on the target training set, yields the best performance on the target test set.

Moreover, we have conducted a large-scale study by systematically constructing a broad range of 715k experimental setup variations. We discover that even small variations to an experimental setup lead to different conclusions about the superiority of a transferability metric over another. Then we propose better evaluations by aggregating across many experiments, enabling to reach more stable conclusions.


  • CVPR 2022 (oral): Transferability Metrics for Selecting Source Model Ensembles

  • CVPR 2022: Transferability Estimation using Bhattacharyya Class Separability

  • ECCV 2022: How stable are Transferability Metrics evaluations?

Shape-conditioned Radiance Fields

We present a method for estimating neural scenes representations of objects given only a single image. The core of our method is the estimation of a geometric scaffold for the object and its use as a guide for the reconstruction of the underlying radiance field. Our formulation is based on a generative process that first maps a latent code to a voxelized shape, and then renders it to an image, with the object appearance being controlled by a second latent code. During inference, we optimize both the latent codes and the networks to fit a test image of a new object. The explicit disentanglement of shape and appearance allows our model to be fine-tuned given a single image. We can then render new views in a geometrically consistent manner and they represent faithfully the input object. Additionally, our method is able to generalize to images outside of the training domain (more realistic renderings and even real photographs). Finally, the inferred geometric scaffold is itself an accurate estimate of the object's 3D shape. We demonstrate in several experiments the effectiveness of our approach in both synthetic and real images.

Transfer Learning across Diverse Appearance Domains and Task Types

Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this work we carry out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Importantly, these are all complex, structured output tasks types relevant to modern computer vision applications. In total we carry out over 1200 transfer experiments, including many where the source and target come from different image domains, task types, or both. We systematically analyse these experiments to understand the impact of image domain, task type, and dataset size on transfer learning performance. Our study leads to several insights and concrete recommendations for practitioners.


  • TPAMI paper (Factors of Influence for Transfer Learning across Diverse Appearance Domains and Task Types)
    [on arXiv since March 2021, new version November 2021]

From points to multi-object 3D Reconstruction

We propose a method to detect and reconstruct multiple 3D objects from a single RGB image. The key idea is to optimize for detection, alignment and shape jointly over all objects in the RGB image, while focusing on realistic and physically plausible reconstructions. To this end, we propose a key-point detector that localizes objects as center points and directly predicts all object properties, including 9-DoF bounding boxes and 3D shapes -- all in a single forward pass.

Vid2CAD: CAD Model Alignment using Multi-view Constraints from Videos

We address the task of aligning CAD models to a video sequence of a complex scene containing multiple objects. Our method is able to process arbitrary videos and fully automatically recover the 9 DoF pose for each object appearing in it, thus aligning them in a common 3D coordinate frame. The core idea of our method is to integrate neural network predictions from individual frames with a temporally global, multi-view constraint optimization formulation. This integration process resolves the scale and depth ambiguities in the per-frame predictions, and generally improves the estimate of all pose parameters. By leveraging multi-view constraints, our method also resolves occlusions and handles objects that are out of view in individual frames, thus reconstructing all objects into a single globally consistent CAD representation of the scene.

CoReNet: Coherent 3D scene reconstruction

We introduce a method for 3D reconstruction from a single RGB image. We build on common encoder-decoder architectures, and we propose three extensions: (1) ray-traced skip connections that propagate local 2D information to the output 3D volume in a physically correct manner; (2) a hybrid 3D volume representation that enables building translation equivariant models, while at the same time encoding fine object details without an excessive memory footprint; (3) a reconstruction loss tailored to capture overall object geometry. Furthermore, we adapt our model to address the harder task of reconstructing multiple objects from a single image. We reconstruct all objects jointly in one pass, producing a coherent reconstruction, where all objects live in a single consistent 3D coordinate frame relative to the camera and they do not intersect in 3D space. We also handle occlusions and resolve them by hallucinating the missing object parts in the 3D volume.


Neural Voxel Renderer

We present a neural rendering framework that maps a voxelized scene into a high quality image. Our method realistically renders highly-textured objects, complete with shading, shadows, reflections and specular highlights on the ground. Moreover, our approach allows controllable rendering: the user can edit the object geometric, texture appearance, light and camera positions. All these modifications are accurately represented in the output rendering.


Open Images V6

Check out Open Images V6, a very large-scale dataset annotated with image-level labels, object bounding boxes, object segmentation masks, visual relationships, and localized narratives. It contains a total of 16M bounding boxes for 600 object classes, making it the largest existing dataset with object location annotations. Open Images also offers 2.8M object segmentation masks for 350 classes, and 3.3M visual relationship annotations indicating pairs of objects in particular relations (e.g. "woman playing guitar", "beer on table"). In V6 we added Localized Narratives, a new form of multimodal annotations consisting of synchronized voice, textual caption, and mouse traces over the objects being described. Learn more on our blog. Finally, the dataset is annotated with 59.9M image-level labels spanning 19,957 classes.


Localized Narratives

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available . We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.


Localized Narratives for Image Retrieval

We present an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where") to express the characteristics of the desired target image. To this end, we learn an image retrieval model using the Localized Narratives dataset, which is capable of performing early fusion between text descriptions and synchronized mouse traces. Qualitative and quantitative experiments show that our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.


  • ICCV 2021 paper (Telling the What while Pointing to the Where: Multimodal Queries for Image Retrieval)

Towards reusable network components

We propose to make a first step towards compatible and hence reusable network components. Rather than training networks for different tasks independently, we adapt the training process to produce network components that are compatible across tasks. In particular, we split a network into two components, a features extractor and a target task head, and propose various approaches to accomplish compatibility between them. We systematically analyse these approaches on the task of image classification on standard datasets. We demonstrate that we can produce components which are directly compatible without any fine-tuning or compromising accuracy on the original tasks. Afterwards, we demonstrate the use of compatible components on three applications: Unsupervised domain adaptation, transferring classifiers across feature extractors with different architectures, and increasing the computational efficiency of transfer learning.


Interactive Object Segmentation

We developed a technique for annotating object segmentation masks where a human annotator and a machine segmentation model collaborate. The machine proposes an initial segmentation, and then the system iterates between (1) the human clicking on errors in the current segmentation; and (2) the machine incorporating these clicks to refine the segmentation. This strategy delivers high quality object segmentation masks with only ~10 clicks per instance. By re-annotating part of the COCO dataset, we have shown that we can produce instance masks 3x faster than traditional polygon drawing tools while also providing better quality.

We used this technique on CrowdCompute to collect 2.7M instance masks over 365 categories in OpenImages (released with Open Images V5, see CVPR'19 paper).

In a follow up work (ECCV'20) we capitalize on the observation that clicks on errors provide ground-truth examples for individual pixels and use that to adapt our model on-the-fly to the current image. This allows it to generalize better to changing distributions and domains.


Research group at Google Zurich

My group at Google Research works in two different areas.

The first area is visual learning. We aim at learning visual models from large image datasets where ground-truth location annotations are available for only a small fraction of all images. We develop techniques based on two general ideas. The first is lifelong learning, where the computer continuously learns new models by building on all the knowledge it acquired in the past. This helps bridging the lack of location supervision for the majority of the data, and to build an integrated, coherent body of visual knowledge. The second idea is human-machine collaboration, where a human annotator and a machine learning model work together to annotate a dataset. This helps improving efficiency over annotating it completely manually, and provides the machine valuable labelled examples of what it does not already know.

The second area is 3D Deep Learning. Our long-term goal is to enable full-scene 3D reconstruction from a single RGB image. We want to address complex scenes with multiple objects partially occluding each other. We plan to exploit learned 3D scene priors to address this underconstrained problem, especially priors on object shapes and their spatial arrangements into characteristic scene patterns, such as chairs around a table. In order to learn effectively we rely heavily on generating photorealistic synthetic data. Along the way we develop various techniques for reconstructing 3D objects from images, fitting CAD models to videos, and for neural rendering.

Check out the publication list!

Research group at University of Edinburgh

Between January 2012 and October 2019 I led the CALVIN research group at the University of Edinburgh. My last PhD student has graduated and so that group is closed now. Check out the CALVIN publication list!.

Short bio

Vittorio Ferrari is a Senior Staff Research Scientist at Google, where he leads a research group on computer vision. He received his PhD from ETH Zurich in 2004, then was a post-doc at INRIA Grenoble (2006-2007) and at the University of Oxford (2007-2008). Between 2008 and 2012 he was an Assistant Professor at ETH Zurich, funded by a Swiss National Science Foundation Professorship grant. In 2012-2018 he was faculty at the University of Edinburgh, where he became a Full Professor in 2016 (now a Honorary Professor). His work on large-scale segmentation won the best paper award at the European Conference in Computer Vision 2012. He received the prestigious ERC Starting Grant, also in 2012. He is the author of over 140 technical publications. He was a Program Chair for ECCV 2018 and a General Chair for ECCV 2020. He is an Associate Editor of the International Journal of Computer Vision, and formerly of IEEE Pattern Analysis and Machine Intelligence. His current research interests are in 3D Deep Learning, transfer learning, and human-machine collaboration for annotation.