Transframer: Arbitrary Frame Prediction with Generative Models

Abstract

We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach unifies a broad range of tasks, from image segmentation, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs sequences of sparse, compressed image features. Transframer is state-of-the-art on a variety of video generation benchmarks, is competitive with the strongest models on few-shot view synthesis, and can generate coherent 30 second videos from a single image without any explicit geometric information. A single generalist Transframer simultaneously produces promising results on 8 tasks, including semantic segmentation, image classification and optical flow prediction with no task-specific architectural components, demonstrating that multi-task computer vision can be tackled using probabilistic image models. Our approach can in principle be applied to a wide range of applications that require learning the conditional structure of annotated image-formatted data.



A framework for general visual prediction. Given a collection of context images with associated annotations (time-stamps, camera viewpoints, etc. ), and a query annotation, the task is to predict a probability distribution over the target image. This framework supports a range of visual prediction tasks, including video modelling, novel view synthesis, and multi-task vision.

Transframer takes as input context DCT-images (left and middle), as well as a partially observed DCT-image of the target (right) and additional annotations a. The inputs are processed by a multi-frame U-Net encoder, which operates at a number of spatial resolutions. U-Net outputs are passed to a DCTransformer decoder via cross-attention, which autoregressively generates a sequence of DCT tokens corresponding to the unseen portion of the target image (shown in green).

Multi-frame U-Net blocks consist of NF-Net convolutional blocks, multi-frame self-attention blocks - which exchange information across input frames - and a Transformer-style residual MLP

RoboNet (128x128, action-conditional)

RoboNet (64x64, action-conditional)

KITTI (64x64)

30 second video generation

We show 30 second video generations prompted only from a single image (the first frame). The lower middle image prompt was created by Onofre Bouvila on Wikipedia under the CC-BY 2.5 license. Otherwise the image prompts were captured by the authors.

ShapeNet (1 context view, 128x128)

ShapeNet (2 context views, 128x128)

Objectron (1 context view, 192x192)

Objectron (2 context views, 192x192)

Multi-task image-to-image modelling