[Google Illuminate Overview Podcast (3min)]
Sandwiched compression augments a standards-based codec with pre-processor and post-processor neural networks. Image/video content passes through the pre-processor, gets encoded then decoded by the standard codec-based system, and then gets post-processed. The neural processors are unets.
The primary goal is to adapt the codec to data and use-cases that are outside of the codec’s design targets. Examples include:
Transporting high-resolution images/video over codecs that can only transport low-resolution.
Transporting high-bit-depth (10, 12-bit) data over codecs that can only transport 8-bit.
Catering to applications where the data will be used to satisfy a sophisticated metric different from the codec’s native metric (typically PSNR):
LPIPS, VMAF, SSIM
Transporting texture maps that will be used to render graphical models with view/lighting dependent metrics imposed.
Compressing data that will be used to enable further computations: ARCore accomplishing SLAM using features it calculates on compressed images from a wearable, cases of calculating depth from stereo, …
…
Pre and post-processors are trained jointly and typically implement a message passing strategy between them. A key advantage of the work is that it keeps the underlying compression and networking infrastructure intact.
In image/video compression scenarios a nice property of this work is that the networks need to generate images/video, i.e., visual data (bottlenecks), which the standard codecs transport. We can hence check the generated images/video (termed bottlenecks) to get an idea of what the networks are trying to accomplish.
Below are some results using JPEG for image compression and HEVC for video compression. (Results similar to the JPEG case are possible with HEIC/HEVC-INTRA.) Please see the papers below and the talk for further results, rate-distortion plots, and network simplification discussion.
HEVC video results are over ten-frame clips (IPP...). Bits-per-pixel numbers reflect the INTRA frame and mp4 overhead being amortized over only ten frames.
Source code at: https://github.com/google/sandwiched_compression
Color over grayscale image codec: While not an interesting application, trying to transport color (3 channels) over grayscale JPEG (single channel) is a good way to qualitatively see how the networks operate. Note the patterns that appear on the bottleneck that (i) survive compression and (ii) get “demodulated” for color.
Highres (HR) over lowres (LR) image codec: This is an interesting application where the networks fold and unfold high frequency information into and out of what the LR codec can transport. Sandwich obtains substantial improvements at the same rate even over a post-processor-only unet.
Note the details on buildings etc., that post-processing alone can’t recover since there are no hints of high frequency in the input. One can think of the sandwich pre-processor as providing these hints for the sandwich post-processor in a rate-distortion efficient manner.
Here is a picture to better see the bottleneck the pre-processor sends through the codec and what the post-processor recovers. A very aliased looking bottleneck is decoded into a picture that can have pixel accurate detail.
Performance and Complexity Sweep: Note the dramatic ~ +6dB improvement (at the same rate) the system has in the HR/LR use case over the quite respectable unet post-processing (shown as post-only.) A thick but shallow unet ([32] & [32, 32], 100K) is capable of most of these improvements evaluated over 500 pictures. This corresponds to less than 1% of parameters of the full unet.
Color over grayscale video codec: Similar to the image/JPEG case above (cat over blue background) the sandwich manages to transport color over the grayscale HEVC. On a large eval set the sandwich obtains +8dB gains over the grayscale HEVC transport.
The most important thing to observe in this ten-frame clip is that the modulation/messaging patterns that the pre-processor injects are temporally coherent. The patterns smoothly move around with the scene objects they are attached to. We see that despite the spatial-only processing applied by the networks the results are temporally coherent.
Highres (HR) over lowres (LR) video codec: Similar to the image/JPEG case above (cat over blue background) the sandwich manages to transport color over the grayscale HEVC. On a large eval set the sandwich obtains +8dB gains over the grayscale HEVC transport.
The most important thing to observe in this ten-frame clip is that the modulation/messaging patterns that the pre-processor injects are temporally coherent. The patterns smoothly move around with the scene objects they are attached to. We see that despite the spatial-only processing applied by the networks the results are temporally coherent.
Concentrate first on the compressed bottleneck clips on the above left. That is what the preprocessor has transported with LR-HEVC that is now ready to be post-processed. They are shown upsampled (x2 both dimensions) using nearest-neighbor-sampling for clarity. Note how blurry and aliased they look. As seen in the sandwich output however, much of the patterned and aliased parts are actually carrying useful information which the post-processor decodes into a detailed clip. Compare the sandwich output to the HEVC-LR output on column-3, concentrating on the detailed areas in particular. Original clips are on the right.
Here is rate v.s. distortion over the eval set.
Video codec with LPIPS: In the below examples we are evaluating quality using LPIPS (averaged over 10 frames for each clip.) At the LPIPS quality how much does the sandwich improve the bit-rate? Over a broad eval set the sandwich obtains 30% reductions in rate with respect to HEVC at the same LPIPS quality. It is very hard to see quality differences between the sandwich and HEVC. At higher quality levels the sandwich is hard to distinguish from the original.
The three sets below show LPIPS improvements depend on the frequency content on the picture with the most improvements on clips with textures and high frequencies. Blurry clips get the least improvements.
Here is rate v.s. LPIPS quality over the eval set.
For details on these results, results over lower quality levels, LPIPS hacking concerns, and more please see the talk.
Papers:
Main reference:
O. G. Guleryuz, P. Chou, B. Isik, H. Hoppe, D. Tang, R. Du, J. Taylor, P. Davidson, S. Fanello, “Sandwiched Compression: Repurposing Standard Codecs with Neural Network Wrappers” in review, {pdf}.
Earlier conf. articles:
B. Isik, O. G. Guleryuz, D. Tang, J. Taylor, P. Chou, “Sandwiched Video Compression: Efficiently Extending the Reach of Standard Codecs with Neural Wrappers,” Proc. IEEE Int’l Conf. on Image Proc. (ICIP2023), Kuala Lumpur, Malaysia, Oct. 2023, {pdf}.
O. G. Guleryuz, P. Chou, H. Hoppe, D. Tang, R. Du, P. Davidson, S. Fanello, “Sandwiched Image Compression: Increasing the resolution and dynamic range of standard codecs,” 2022 Picture Coding Symposium (PCS), San Jose, CA, Dec. 2022, {pdf}.
O. G. Guleryuz, P. Chou, H. Hoppe, D. Tang, R. Du, P. Davidson, S. Fanello, “Sandwiched Image Compression: Wrapping Neural Networks Around a Standard Codec,” Proc. IEEE Int’l Conf. on Image Proc. (ICIP2021), Anchorage, AK, Sept. 2021, {pdf}.