Transframer: Arbitrary Frame Prediction with Generative Models
Abstract
We present a general-purpose framework for image modelling and vision tasks based on probabilistic frame prediction. Our approach unifies a broad range of tasks, from image segmentation, to novel view synthesis and video interpolation. We pair this framework with an architecture we term Transframer, which uses U-Net and Transformer components to condition on annotated context frames, and outputs sequences of sparse, compressed image features. Transframer is state-of-the-art on a variety of video generation benchmarks, is competitive with the strongest models on few-shot view synthesis, and can generate coherent 30 second videos from a single image without any explicit geometric information. A single generalist Transframer simultaneously produces promising results on 8 tasks, including semantic segmentation, image classification and optical flow prediction with no task-specific architectural components, demonstrating that multi-task computer vision can be tackled using probabilistic image models. Our approach can in principle be applied to a wide range of applications that require learning the conditional structure of annotated image-formatted data.
A framework for general visual prediction. Given a collection of context images with associated annotations (time-stamps, camera viewpoints, etc. ), and a query annotation, the task is to predict a probability distribution over the target image. This framework supports a range of visual prediction tasks, including video modelling, novel view synthesis, and multi-task vision.
Transframer takes as input context DCT-images (left and middle), as well as a partially observed DCT-image of the target (right) and additional annotations a. The inputs are processed by a multi-frame U-Net encoder, which operates at a number of spatial resolutions. U-Net outputs are passed to a DCTransformer decoder via cross-attention, which autoregressively generates a sequence of DCT tokens corresponding to the unseen portion of the target image (shown in green).
Multi-frame U-Net blocks consist of NF-Net convolutional blocks, multi-frame self-attention blocks - which exchange information across input frames - and a Transformer-style residual MLP