Compressed Vision for
Efficient Video Understanding

Olivia Wiles*, Joao Carreira, Iain Barr, Andrew Zisserman, Mateusz Malinowski*

* corresponding authors {oawiles, mateuszm}@deepmind.com

Abstract

Experience and reasoning occur across multiple temporal scales: milliseconds, seconds, hours or days. The vast majority of computer vision research, however, still focuses on individual images or short videos lasting only a few seconds. This is because handling longer videos require more scalable approaches even to process them.

In this work, we propose a framework enabling research on hour-long videos with the same hardware that can now process second-long videos. We replace standard video compression, e.g. JPEG, with neural compression and show that we can directly feed compressed videos as inputs to regular video networks.

Operating on compressed videos improves efficiency at all pipeline levels -- data transfer, speed and memory -- making it possible to train models faster and on much longer videos. Processing compressed signals has, however, the downside of precluding standard augmentation techniques if done naively. We address that by introducing a small network that can apply transformations to latent codes corresponding to commonly used augmentations in the original video space.

We demonstrate that with our compressed vision pipeline, we can train video models more efficiently on popular benchmarks such as Kinetics600 and COIN. We also perform proof-of-concept experiments with new tasks defined over hour-long videos at standard frame rates. Processing such long videos is impossible without using compressed representation.


Framework

We compare quality of different compression methods at high compression rates (about 200x)

Standard video pipeline

Standard video pipelines use typical compression methods to store compressed videos and next decompress them, apply augmentations, and train a downstream model. Even though this is a common pipeline, it is inefficient.

Our Compressed Vision pipeline

Compressed Vision uses a neural compressor to get better compression results but instead of decompressing videos before processing them, it does all the computations (augmentations and classification) directly in the compressed space. This saves time and space, allowing for faster and longer videos processing.

Qualitative Results

We compare the quality of different compression methods at high compression rates (about 200x).

JPEG

MPEG

Neural (Ours)

We can compress videos even further (about 700x).

More examples with high-compression rate (about 200x)

Learnt augmentations that operate directly in the compressed space

We can also apply transformations directly on the compressed latents without decompression. While these transformations are done in compressed space, we can inspect if the transformations are correct by doing decompression the transformed latents. Here, we show a few such transformations.

Saturation

Crop

Rotation

All the networks doing augmentations in the latent space condition on arguments. In this way, we can obtain a wide range of different transformations that are controlled by these arguments. Below, we apply the same transformation (changing the brightness) conditioned on different brightness arguments.

Quantitative Results

Downstream accuracy. We pre-trained a neural compressor on two different datasets (K600 and Walking Tours) and trained a downstream network on Kinetics 600 (K600). We evaluated both neural codes on Kinetics 600 validation split, to measure the final performance. Both neural compressors achieve good performance.

Competitive performance on per-frame video understanding task. We also evaluated our pipeline on the COIN dataset, which is a per-frame classification task. Here, we also note great performance when using more efficient compressed space.

Conclusions

Competitive performance on whole-video understanding task. We can see that using the compressed representations (CR, compression rates 30 and 475) are very competitive with the baseline (CR~1, RGB input) for training downstream tasks. As the whole framework is modular, it allows for easy replacement of each component: the compression methods, augmentation network, or downstream network.

Excellent transfer. As can be seen, it does not matter if we pre-train the network on Kinetics 600 or Walking Tours. Both models work well on the downstream tasks.

Efficiency gains

With compression rate 30x (CR~30) we can store 30 times more video content. This could be new and different videos or much longer videos. We are also getting similar gains if we transfer data in the bandwidth-limited network. As we avoid decompression, not only can we store more on disk and transfer more between network nodes, but we can also place more data on a device (such as a GPU or TPU). This allows long-video training.

There is also one more nuance here. Decompression with neural networks is slow. For instance, JPEG decompression of a video with 32 frames and a compression rate of about 30x takes around 0.02s. Neural decompression is closer to one second. Therefore, even if we achieve superior quality at higher compression rates with a neural compressor it is better to avoid decompression for improved performance. This is possible within our framework as augmentations and training of a downstream network can both be performed solely directly in the compressed space.

Results on Long Video Understanding

Finally, we show our pipeline can be used to process 30 minutes-long or even longer videos.

Walking Tours Dataset. As there are no publicly available datasets for long videos, we have internally collected our own dataset. It is an ego-centric dataset of tourists walking in different places. It has about 18k videos, with 1815 videos to be used for evaluation. Videos are continuous (without cuts) and long (40 minutes on average). In total, the dataset amounts to 500 days of accumulated video time, which is much larger than the Kinetics600 accumulated video time. Short clips from the dataset can be seen on the left.

Past-Future Task. We evaluated efficiency of our framework on the Walking Tours dataset with the following task. The network observes 5 second-long video clips and uses compression to store them in the memory. Each time a query clip arrives, the network needs to determine if that clip has been observed so far.

The network solves the task on 30 minute-long videos. Without compressed vision, this would not be possible.