Learning from One Continuous Video Stream  

CVPR 2024

Joao Carreira1 †, Michael King1 †, Viorica Patraucean1 †, Dilara Gokay1 †, Catalin Ionescu1 †

Yi Yang, Daniel Zoran, Joseph Heyward, Carl Doersch, Yusuf Aytar, Dima Damen† ‡, Andrew Zisserman† ♢

Google DeepMind, University of Bristol, University of Oxford, 1 core contributor

Corresponding author: joaoluis@google.com

Abstract

We introduce a framework for online learning from a single continuous video stream -- the way people and animals learn, without mini-batches, data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of streams and tasks composed from two existing video datasets, plus methodology for performance evaluation that considers both adaptation and generalization. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation as well as between arbitrary tasks, without ever requiring changes to models and always using the same pixel loss. Equipped with this framework we obtained large single-stream learning gains from pre-training with a novel family of future prediction tasks, found that momentum hurts, and that the pace of weight updates matters. The combination of these insights leads to matching the performance of IID learning with batch size 1, when using the same architecture  and without costly replay buffers.


"Natural learning"

Many have imagined AI systems of the future, like the robot in Short Circuit, to have exceptional, super-human learning abilities -- in the original sense of the word "learning" -- from a single experience stream combining vision, audio, proprioception, touch, etc. 

Current AI systems, while impressive, only learn offline. This paper is a step towards creating the ability of learning "naturally", common to most animals.

Framework

Top: We introduce a framework for studying continuous learning in a single video stream. This is a natural yet unstudied problem, different from standard independent and identically distributed (IID) learning in video where batches contain clips from random videos in a random order.

Bottom: We propose to employ pixel-to-pixel models to evaluate our approach across prediction tasks (prediction of future frames, depth, segmentation). We measure both adaptation to the video stream -- the model here updates its weights (learns) continuously to improve prediction -- as well as generalization to out-of-stream clips -- the model being adapted on the first stream is now evaluated on a different held-out stream without being allowed to adapt to it. We propose to maximize both adaptation and generalization.

Results

Pretraining strategies

We introduce a novel family of pretraining methods that generalize future video prediction. All of them follow the same pattern: given one input video clip (we consider 4 frames) the model is trained to predict a future clip from the same video, also 4 frames long. We consider three methods below and they are sorted from easy to hard, top to bottom.

The pretraining is done on the Kinetics-700-2020 dataset and with a ViT-L. In addition, we added a linear head to the model just before the decoder (we use a stack of 4 self attention layers on top of the ViT-L encoder) and employed a standard cross-entropy loss between logits and ground truth labels in conjunction with a stop gradient so the supervision would not influence the backbone weights.

Guided future prediction

Guided future prediction overlays a few patches from the future clip into the input video clip, hence narrowing down the range of possibilities.

Input

Prediction

Target

Vanilla future prediction

Vanilla future prediction is just the standard task of predicting the future.

Input

Prediction

Target

Masked future prediction

Masked future prediction is a variation of Masked AutoEncoding where the model must predict the future clip based on a partial view of the current clip. Note this is strictly harder than vanilla future prediction since the model cannot even fully see the present clip.

Input

Prediction

Target

Quantitative results

Linear top-1 accuracy in Kinetics during pretraining for different forms of future prediction tasks: Vanilla, Guided, Masked, and different displacements. The longer the displacement the better the accuracy is. For the longest displacement the best method for top-1 accuracy emerges as Guided Future Prediction.

BL vs STDL

We compare our best setup, which we call "Baby Learning" (BL), to a standard deep learning setup (STDL) on ScanNetV2 semantic segmentations. The model for both setups is a ViT-L.

STDL

STDL uses AdamW with standard parameters (learning rate 1e-4, momentum b1 as 0.9), with weight updates after each batch, the same ViT-L model but with the popular ImageNet MAE checkpoint. We implement the IID setting by sampling a sequence of random time steps (and associated target time steps) from random videos of the same base dataset.

Input

Prediction

Target

BL

We use a continuous video stream to train the model for BL.

Input

Prediction

Target

Quantitative results

Future prediction results for video streams from various datasets for two different temporal displacements (horizontal). We show results for the standard deep learning approach (STDL) on IID data with batch size 16 and 1, and for our approach (BL) when using a continuous video stream (Cont.). Two numbers are reported for every cell, corresponding to in-stream / out-of-stream performance. We highlight which of the two batch size 1 approaches performs best for each number.

Our approach BL matches STDL with batch size 1 out-of-stream while outperforming it in-stream.