Vittorio Ferrari


Director of Science at Synthesia.io

Short bio

Vittorio Ferrari is the Director of Science at Synthesia, where he leads R&D groups developing cutting-edge generative AI technology. Previously he built and led multiple research groups on computer vision and machine learning at Google (Principal Scientist), the University of Edinburgh (Full Professor), and ETH Zurich (Assistant Professor). He has co-authored over 160 scientific papers and won the best paper award at the European Conference in Computer Vision in 2012 for his work on large-scale segmentation. He received the prestigious ERC Starting Grant, also in 2012.  He led the creation of Open Images, one of the most widely adopted computer vision datasets worldwide. While at Google his groups contributed technology to several major products (with launches e.g. on the Pixel phone, Google Photos, Google Lens). He was a Program Chair for ECCV 2018 and a General Chair for ECCV 2020. He is an Associate Editor of IEEE Pattern Analysis and Machine Intelligence, and formerly of the International Journal of Computer Vision. His recent research interests are in 3D Deep Learning and Vision+Language models.

On this page you can find some examples of recent projects. See the publication page for may more!

CAD-Estate dataset

CAD-Estate is a large dataset of complex multi-object RGB videos, each annotated with a globally-consistent 3D representation of its objects, as well as with a room layout, consisting of structural elements in 3D, such as wall, floor, and ceiling.

We annotate each object with a CAD model from a database, and place it in the 3D coordinate frame of the scene with a 9-DoF pose transformation. CAD-Estate offers 101K instances of 12K unique CAD models.  The videos of CAD-Estate feature wide complex views of real estate properties. They contain numerous objects in each frame, many of which are far from the camera and thus appear small.  

We provide annotations of generic 3D room layouts for 2246 videos, with the 3D plane equations and spatial extents of all structural elements. The videos contain complex topologies, with multiple rooms connected by open doors, multiple floors connected by stairs, and generic geometry with slanted structural elements.

Resources

Encyclopedic VQA

Encyclopedic-VQA (arxiv) is a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. 

Resources

Connecting Vision and Language with Video Localized Narratives 

We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. We annotated 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words. Based on this data, we also construct new benchmarks for the video narrative grounding and video question answering tasks, and provide reference results from strong baseline models.

RayTran: 3D reconstruction of multiple objects from videos with ray-traced transformers 

RayTran is a transformer-based neural network architecture for multi-object 3D reconstruction from RGB videos. It relies on two alternative ways to represent its knowledge: as a global 3D grid of features and an array of view-specific 2D grids. We progressively exchange information between the two with a dedicated bidirectional attention mechanism. We exploit knowledge about the image formation process to significantly sparsify the attention weight matrix, making our architecture feasible on current hardware, both in terms of memory and computation. We attach a DETR-style head on top of the 3D feature grid in order to detect the objects in the scene and to predict their 3D pose and 3D shape. Compared to previous methods, our architecture is single stage, end-to-end trainable, and it can reason holistically about a scene from multiple video frames without needing a brittle tracking step. On the challenging Scan2CAD dataset, we outperform (1) recent state-of-the-art methods (Vid2CAD, ODAM) for 3D object pose estimation from RGB videos; and (2) a strong alternative method combining Multi-view Stereo with RGB-D CAD alignment. 

Urban Radiance Fields

The goal of this work is to perform 3D reconstruction and novel view synthesis from data captured by scanning platforms commonly deployed for world mapping in urban outdoor environments (e.g., Street View). Given a sequence of posed RGB images and lidar sweeps acquired by cameras and scanners moving through an outdoor scene, we produce a model from which 3D surfaces can be extracted and novel RGB images can be synthesized. Our approach extends Neural Radiance Fields, which has been demonstrated to synthesize realistic novel images for small scenes in controlled settings, with new methods for leveraging asynchronously captured lidar data, for addressing exposure variation between captured images, and for leveraging predicted image segmentations to supervise densities on rays pointing at the sky. Each of these three extensions provides significant performance improvements in experiments on Street View data. Our system produces state-of-the-art 3D surface reconstructions and synthesizes higher quality novel views in comparison to both traditional methods (e.g.~COLMAP) and recent neural representations (e.g.~Mip-NeRF).

Transferabilty Metrics

Transfer learning has become a popular method for leveraging pre-trained models in computer vision. However, without performing computationally expensive fine-tuning, it is difficult to quantify which pre-trained source models are suitable for a specific target task, or, conversely, to which tasks a pre-trained source model can be easily adapted to. Transferability metrics try to evaluate the fit of a source model for a particular target. We proposed two new transferability metrics:


Moreover, we have conducted a large-scale study by systematically constructing a broad range of 715k experimental setup variations. We discover that even small variations to an experimental setup lead to different conclusions about the superiority of a transferability metric over another. Then we propose better evaluations by aggregating across many experiments, enabling to reach more stable conclusions. 


Resources

Shape-conditioned Radiance Fields

We present a method for estimating neural scenes representations of objects given only a single image. The core of our method is the estimation of a geometric scaffold for the object and its use as a guide for the reconstruction of the underlying radiance field. Our formulation is based on a generative process that first maps a latent code to a voxelized shape, and then renders it to an image, with the object appearance being controlled by a second latent code. During inference, we optimize both the latent codes and the networks to fit a test image of a new object. The explicit disentanglement of shape and appearance allows our model to be fine-tuned given a single image. We can then render new views in a geometrically consistent manner and they represent faithfully the input object. Additionally, our method is able to generalize to images outside of the training domain (more realistic renderings and even real photographs). Finally, the inferred geometric scaffold is itself an accurate estimate of the object's 3D shape. We demonstrate in several experiments the effectiveness of our approach in both synthetic and real images.

Transfer Learning across Diverse Appearance Domains and Task Types

Transfer learning enables to re-use knowledge learned on a source task to help learning a target task. A simple form of transfer learning is common in current state-of-the-art computer vision models, i.e. pre-training a model for image classification on the ILSVRC dataset, and then fine-tune on any target task. However, previous systematic studies of transfer learning have been limited and the circumstances in which it is expected to work are not fully understood. In this work we carry out an extensive experimental exploration of transfer learning across vastly different image domains (consumer photos, autonomous driving, aerial imagery, underwater, indoor scenes, synthetic, close-ups) and task types (semantic segmentation, object detection, depth estimation, keypoint detection). Importantly, these are all complex, structured output tasks types relevant to modern computer vision applications. In total we carry out over 1200 transfer experiments, including many where the source and target come from different image domains, task types, or both. We systematically analyse these experiments to understand the impact of image domain, task type, and dataset size on transfer learning performance. Our study leads to several insights and concrete recommendations for practitioners.

Resources

From points to multi-object 3D Reconstruction

We propose a method to detect and reconstruct multiple 3D objects from a single RGB image. The key idea is to optimize for detection, alignment and shape jointly over all objects in the RGB image, while focusing on realistic and physically plausible reconstructions. To this end, we propose a key-point detector that localizes objects as center points and directly predicts all object properties, including 9-DoF bounding boxes and 3D shapes -- all in a single forward pass. 

Vid2CAD: CAD Model Alignment using Multi-view Constraints from Videos

We address the task of aligning CAD models to a video sequence of a complex scene containing multiple objects. Our method is able to process arbitrary videos and fully automatically recover the 9 DoF pose for each object appearing in it, thus aligning them in a common 3D coordinate frame. The core idea of our method is to integrate neural network predictions from individual frames with a temporally global, multi-view constraint optimization formulation. This integration process resolves the scale and depth ambiguities in the per-frame predictions, and generally improves the estimate of all pose parameters. By leveraging multi-view constraints, our method also resolves occlusions and handles objects that are out of view in individual frames, thus reconstructing all objects into a single globally consistent CAD representation of the scene.

CoReNet: Coherent 3D scene reconstruction

We introduce a method for 3D reconstruction from a single RGB image. We build on common encoder-decoder architectures, and we propose three extensions: (1) ray-traced skip connections that propagate local 2D information to the output 3D volume in a physically correct manner; (2) a hybrid 3D volume representation that enables building translation equivariant models, while at the same time encoding fine object details without an excessive memory footprint; (3) a reconstruction loss tailored to capture overall object geometry. Furthermore, we adapt our model to address the harder task of reconstructing multiple objects from a single image. We reconstruct all objects jointly in one pass, producing a coherent reconstruction, where all objects live in a single consistent 3D coordinate frame relative to the camera and they do not intersect in 3D space. We also handle occlusions and resolve them by hallucinating the missing object parts in the 3D volume.

Resources

Neural Voxel Renderer

We present a neural rendering framework that maps a voxelized scene into a high quality image. Our method realistically renders highly-textured objects, complete with shading, shadows, reflections and specular highlights on the ground. Moreover, our approach allows controllable rendering: the user can edit the object geometric, texture appearance, light and camera positions. All these modifications are accurately represented in the output rendering.

Resources

Open Images Dataset V7

Open Images V7 is a dataset of ~9 million images that have been annotated with image-level labels, object bounding boxes, visual relationships, object instance segmentations, point-level labels, and localized narratives.  


Resources

Localized Narratives

We propose Localized Narratives, a new form of multimodal image annotations connecting vision and language. We ask annotators to describe an image with their voice while simultaneously hovering their mouse over the region they are describing. Since the voice and the mouse pointer are synchronized, we can localize every single word in the description. This dense visual grounding takes the form of a mouse trace segment per word and is unique to our data. We annotated 849k images with Localized Narratives: the whole COCO, Flickr30k, and ADE20K datasets, and 671k images of Open Images, all of which we make publicly available . We provide an extensive analysis of these annotations showing they are diverse, accurate, and efficient to produce. We also demonstrate their utility on the application of controlled image captioning.

Resources

Localized Narratives for Image Retrieval

We present an image retrieval setup where the user simultaneously describes an image using both spoken natural language (the "what") and mouse traces over an empty canvas (the "where") to express the characteristics of the desired target image. To this end, we learn an image retrieval model using the Localized Narratives dataset, which is capable of performing early fusion between text descriptions and synchronized mouse traces. Qualitative and quantitative experiments show that our model is capable of taking this spatial guidance into account, and provides more accurate retrieval results compared to text-only equivalent systems.

Resources

Towards reusable network components

We propose to make a first step towards compatible and hence reusable network components. Rather than training networks for different tasks independently, we adapt the training process to produce network components that are compatible across tasks. In particular, we split a network into two components, a features extractor and a target task head, and propose various approaches to accomplish compatibility between them. We systematically analyse these approaches on the task of image classification on standard datasets. We demonstrate that we can produce components which are directly compatible without any fine-tuning or compromising accuracy on the original tasks. Afterwards, we demonstrate the use of compatible components on three applications: Unsupervised domain adaptation, transferring classifiers across feature extractors with different architectures, and increasing the computational efficiency of transfer learning.

Resources

Interactive Object Segmentation

We developed a technique for annotating object segmentation masks where a human annotator and a machine segmentation model collaborate. The machine proposes an initial segmentation, and then the system iterates between (1) the human clicking on errors in the current segmentation; and (2) the machine incorporating these clicks to refine the segmentation. This strategy delivers high quality object segmentation masks with only ~10 clicks per instance. By re-annotating part of the COCO dataset, we have shown that we can produce instance masks 3x faster than traditional polygon drawing tools while also providing better quality.

We used this technique on CrowdCompute to collect 2.7M instance masks over 365 categories in OpenImages (released with Open Images V5, see CVPR'19 paper).

In a follow up work (ECCV'20) we capitalize on the observation that clicks on errors provide ground-truth examples for individual pixels and use that to adapt our model on-the-fly to the current image. This allows it to generalize better to changing distributions and domains.

Resources

Research group at Google Zurich

While at Google, I built two research groups in the areas below. I also was co-leading an applied research group with focus on launching in Google products.

The first  area is visual learning. We aim at learning visual models from large image datasets where ground-truth location annotations are available for only a small fraction of all images. We develop techniques based on two general ideas. The first is lifelong learning, where the computer continuously learns new models by building on all the knowledge it acquired in the past. This helps bridging the lack of location supervision for the majority of the data, and to build an integrated, coherent body of visual knowledge. The second idea is human-machine collaboration, where a human annotator and a machine learning model work together to annotate a dataset. This helps improving efficiency over annotating it completely manually, and provides the machine valuable labelled examples of what it does not already know.

The second area is 3D Deep Learning. Our long-term goal is to enable full-scene 3D reconstruction from a  single RGB image. We want to address complex scenes with multiple objects partially occluding each other. We plan to exploit learned 3D scene priors to address this underconstrained problem, especially priors on object shapes and their spatial arrangements into characteristic scene patterns, such as chairs around a table. In order to learn effectively we rely heavily on generating photorealistic synthetic data. Along the way we develop various techniques for reconstructing 3D objects from images, fitting CAD models to videos, and for neural rendering.

Check out the publication list!

Research group at University of Edinburgh

Between January 2012 and October 2019 I led the CALVIN research group at the University of Edinburgh. My last PhD student has graduated and so that group is closed now. Check out the CALVIN publication list!.