Composable Augmentation Encoding for Video Representation Learning

Chen Sun, Arsha Nagrani, Yonglong Tian, and Cordelia Schmid

ICCV 2021 [arXiv] [Github]


Self-supervised contrastive works focus on how to create positive and negative pairs of views, but never use this info again. Why not also encode the pair generation method? For example, when spatial crops or temporal shifts are used to create pairs, what would happen if we also encode the relative spatial coordinates or the time shift as an additional embedding?

Our hypothesis is that, given the additional information, the model can decide whether to be invariant or not to different transformations. It can be invariant to shape deformations, or it can learn that over time an orange can be cut into slices, and use that knowledge to further minimize the contrastive loss.

Model Overview:

Positive pairs are constructed from the same instance. For each view, a random set of data augmentations (e.g. temporal shifting, spatial cropping) is sampled and applied. The views are then encoded by a shared visual encoder. Encoded visual features, along with parameterised and embedded data augmentations, are then passed to a transformer head (this contains multiple layers, only the input layer is shown for simplicity) which summarises the input sequence and generates projected features for contrastive learning. In this example, the bottom transformer head is tasked to predict the features knowing the temporal augmentation (predict features t seconds ahead in time) and spatial augmentation (shift of box coordinates) relative to the first view. The visual encoder f is transferred to the downstream tasks.

Experimental Results:



author = {Chen Sun and Arsha Nagrani and Yonglong Tian and Cordelia Schmid},

title = {Composable Augmentation Encoding for Video Representation Learning},

booktitle = {ICCV},

year = {2021},