CVPR 2025
Tengda Han†, Dilara Gokay†, Joseph Heyward†, Chuhan Zhang†,
Daniel Zoran†, Viorica Patraucean†, João Carreira†, Dima Damen† ‡, Andrew Zisserman† ♢
† Google DeepMind, ‡ University of Bristol, ♢ University of Oxford
Corresponding author: tengda@google.com
We address the challenge of representation learning from a continuous stream of video as input, in a self-supervised manner. This differs from the standard approaches to video learning where videos are chopped and shuffled during training in order to create a non-redundant batch that satisfies the independently and identically distributed (IID) sample assumption expected by conventional training paradigms. When videos are only available as a continuous stream of input, the IID assumption is evidently broken, leading to poor performance. We demonstrate the drop in performance when moving from shuffled to sequential learning on three tasks: the one-video representation learning method DoRA, standard VideoMAE on multi-video datasets, and the task of future video prediction. To address this drop, we propose a geometric modification to standard optimizers, to decorrelate batches by utilising orthogonal gradients during training. The proposed modification can be applied to any optimizer — we demonstrate it with Stochastic Gradient Descent (SGD) and AdamW. Our proposed orthogonal optimizer allows models trained from streaming videos to alleviate the drop in representation learning performance, as evaluated on downstream tasks. On all three scenarios (DoRA, VideoMAE, future prediction), we show our orthogonal optimizer outperforms the strong AdamW baseline.
Human learns from a continuous stream of video. But for computer vision system, learning from a continuous video stream is challenging.
In this setting, the gradients are highly correlated over training steps (top), which is different to standard deep learning approaches that learn from shuffled video clips (IID setting) and the gradients are not correlated over training steps (bottom). The performance drops when moving from shuffled training to sequential training.
We aim to enable learning from sequential videos, and we propose a straightforward method: as the gradients are temporally correlated, the model could learn from the orthogonal components of the gradients.
A simplified illuWe aim to enable learning from sequential videosstration is shown here.
(a) In common IID training, the gradient between consecutive steps are not very correlated due to the IID nature.
(b) If learning from sequential videos, the gradients between consecutive steps are highly correlated, which harms the optimization. We propose to update the model parameters from the orthogonal components of the current gradient, denoted as u_t.
The proposed modification is applicable to many optimizers. Here we show two commonly used optimizer algorithms for example: Orthogonal-SGD and Orthogonal-AdamW. The text in green indicates the addition to the original algorithms
The key modification can be implemented within a few lines of code changes. We provide the full implementation of Orthogonal-AdamW in JAX (optax) and PyTorch: link.
We experiment with the proposed Orthogonal-AdamW optimizer on three sequential learning scenarios:
one-video representation learning method DoRA,
standard VideoMAE on multi-video datasets,
and the task of future video prediction
On all three scenarios, the orthogonal optimizer outperforms the strong AdamW baseline. Please refer to our paper for more details.
We thank James Martens for technical suggestions on the optimizers, and Jean-Baptiste Alayrac and Matthew Grimes for reviewing the manuscript. We also thank Carl Doersch, Ignacio Rocco, Michael King, Yi Yang and Yusuf Aytar for helpful discussions.
If you find our project helpful to your research, you can cite us with:
@InProceedings{han25orthogonal,
title={Learning from Streaming Video with Orthogonal Gradients},
author={Han, Tengda and Gokay, Dilara and Heyward, Joseph and Zhang, Chuhan and Zoran, Daniel and P{\u{a}}tr{\u{a}}ucean, Viorica and Carreira, Jo{\~a}o and Damen, Dima and Zisserman, Andrew},
booktitle={CVPR},
year={2025}}