Final Report

Nghia Dam December 12, 2022

Ben Kung

AV1 vs. VVC: the quest for finding the best video codec for video streaming in bandwidth-constrained, classroom environments

Abstract

Since the outburst of the Covid-19 pandemic at the beginning of 2020, the world has seen a massive adoption of video streaming for classroom, enterprise and business applications at a break-neck speed. A vast range of video streaming applications was introduced as a result of this trend, including Zoom, Google Meet and Microsoft Teams. The heart of those applications is video codecs. A good choice of video codec brings performance and a seamless streaming experience to users. In this project, we examine the two relatively new video codecs, namely VVC and AV1, to determine the most optimal candidate for bandwidth-constrained, classroom-based environments. While VVC has shown some of its strengths such as higher encoding speed, AV1 yet remains to be a better candidate when being used under resource-limited and low-bandwidth environments.

1. Introduction

In the modern age, video streaming has become commonplace and accounts for over 80% of internet traffic in 2022 and is projected to keep rising. To reduce the high bandwidth demand, video codecs compress videos, making them smaller to transfer over the internet. Although, as new innovations continue to arise, such as 4K streaming and VR, larger amounts of data need to get transferred. To combat the issue, companies have created stronger video codecs, which encode videos with fewer bits and in higher quality. Two such of these modern encoders are AV1 and VVC (Versatile Video Coding).

1.1 History of AV1 and VVC

The Alliance for Open Media was announced back in 2015. The group consisted of highly notable members, including Google, Hua Wei, Apple, Amazon, Netflix and more. The goal of the organization was to create a royalty-free video codec that would rival HEVC developed by MPEG. Their motivation for their work was MPEG increasing the royalty pricing on using their codec while having a monopoly on the market. Therefore, they started developing AV1 which finished in 2018. The codec was built upon the foundation of previously developed codecs, including Google’s VP9, Cisco’s Thor, and Mozilla’s Daala.

The Motion Picture Expert Group or MPEG lead the video codec industry since their creation back in 1988. Their codec AVC (Advanced Video Coding) or H.264 is the most commonly used video codec as of the present. However, the company continues to expand upon their work with its creation of VVC back in 2020. It is designed to be a versatile codec that will perform the best for today’s needs and contain the necessary components to support the technology of the future including 360-degree videos, 8K, and VR. VVC encapsulates many of the functionalities provided by HEVC (or H.265 - High-Efficiency Video Coding) and builds upon it.

1.2 Technical Comparison of AV1 and VVC

AV1 and VVC are the most recent video encoders created by the two main players in the industry. They are both very new and have many similarities and differences. The references for the comparisons come from an in-depth explanation of each video codec [6][7].

1.2.1 Block Partition

AV1 and VVC are very similar in terms of how they handle block partition. Both expanded upon their reference encoder by going from a 64x64 block up to a 128x128 superblock. Furthermore, they both add functionality to create non-square rectangular blocks. However, VVC has a functionality which allows it to partition the chroma and luma sub-layers independently, allowing for further control

1.2.2 Intra-Frame Prediction

In terms of intra-frame prediction, VVC completely outshines AV1. They both have basic DC and True motion methods for creating prediction vectors. They also share the function to predict the chroma values from the luma values, called Chroma from Luma in AV1 and Cross-component linear model in VVC. Palette mode was also introduced into both codecs allowing colours to map to a finite set of distinct colours, which will be useful for encoding digital media. However, in terms of directional intra-prediction, AV1 only has 8 base directions that can be varied by increments of 3 degrees of freedom, creating 56 possible directional prediction modes. On the other hand, VVC has 63 possible prediction angles for square blocks. Additionally, for rectangular blocks, there are 14 unique angles and two possible configurations for a rectangle creating 28 possible angles for a rectangle. In total, VVC has 93 possible prediction angles which adapt to the shape of the block which far surpasses AV1’s 56 fixed angles. On top of this, VVC has new innovative prediction methods, the first of which is matrix based intra prediction (MIP). Using a low-complexity neural network, VVC selects from an array of predefined transformation matrices and chooses the best fit from the training it has done. These transformation matrices can also help with upscaling and downscaling a frame. VVC also has additional filters it can use to interpolate luma values using a complex set of 4 tap filters, meaning that it checks the values around a set point in 4 different methods, then takes a combination of the results.

1.2.3 Inter-Frame Prediction

In terms of inter-frame prediction, AV1 and VVC are quite different with their latest innovations. AV1 can pull from a large pool of 7 reference frames, 4 in the past and 3 in the future. The predictions can be mono or bi-directional, meaning that it can take 1 or two reference frames when computing the prediction. The frames don’t have to be both in the past and future too. It can then take spacial and temporal candidates using different prediction methods and combine them into a pool. The final result can pull from any of the 4 best results, or take linear combinations of the results. It also supports more complex ways of merging candidates, such as with a compound wedge prediction. The predictions also support affine transformations, meaning that blocks can be scaled, rotated or sheered. Contrasting this, VVC also supports mono and bi-directional predictions and can create spacial and temporal candidates. However, for bi-directional predictions, one frame must come from the past while the other comes from the future. Although bi-directional prediction is simpler in VVC and has a smaller reference frame pool to pull from, there are many more advanced combination tools for VVC. VVC includes wedge prediction and affine transformations but also supports history-based motion vectors. These use the previously created motion vectors to predict the motion of a block. Two kinds of trajectory-based prediction exist called Symmetric Motion Vector and bi-directional optical flow. Finally, VVC has inter-prediction support for 360-degree videos that will be useful for VR when it becomes more mainstream.

1.2.4 Transformation and entropy encoder

Transformation and entropy encoders have been included in the same section since they are very similar. In terms of transformation, both support non-square transformations and allow transformations to be partitioned up to 64x64 down to a 2x2 case. With respect to the entropy encoder, they both use modifications to arithmetic coding which allows syntax elements to get encoded. These encodings get stored as a probability function. The only real difference is that VVC uses binary, while AV1 uses hexadecimal, which is the same as binary from the machine’s point of view but they brand it as such to avoid the MPEG patents. Using hex elements also allows them to encode multiple variables at the same time.

1.2.5 Post Processing

AV1 and VVC went in different directions on what to focus on for post-processing. AV1 put effort into creating many new post-processing effects, whereas VVC chose to further improve one of the most fundamental components of post-processing, deblocking.

AV1 kept its basic deblocking filter from VP9 and only added one new feature called constrained directional enhancement filter, which applies a 5x5 area filter to further improve the accuracy of deblocking. VVC on the other hand added several new features to the deblocking filter. The first is a feature is called Luma Mapping With Chroma Scaling which adds functionality to remap luma values in the deblocking process while scaling the intensity based on the chroma values. The second feature is Long Deblocking Filters which allows luma blocks to consider values in a range of 16x16 for luma and 8x8 for chroma. It also has features to adapt the strength of the filter based on the luma values and can use spacial filters in a 7x7 area to smooth over artifacts.

Both codecs added an in-loop restoration feature which restores sharp edges which may have been lost due to compression. They both support a Wiener filter which takes neighbours into consideration to restore edges, reduce noise, and lower the error of the output. However, AV1 also has an additional feature called self-guided projection filter, which takes the linear combination of two cheap to compute restoration filters.

Two new features were added to AV1 which aren’t supported in VVC. The first is frame super resolution. This feature allows AV1 to scale the bitstream based on the current bandwidth. It creates an environment where frames can be encoded at lower resolutions and then super-resolved to higher quality. It gives AV1 the ability to self-scale to the environment without the need for outside applications. The second feature added is called film grain synthesis. Film grain is caused due to small particles on the film strip creating noise in the image. Nowadays, film grain is often added to give a sense of age to a shot and can be used for artistic effect. However, since film grain is random, it is hard to predict and can increase the encoded file size significantly. To combat the problem, AV1 digitally removes all film grain from a shot, and stores metadata based on its properties. It then adds the grain effect back in during the decoding stage.

1.3 Motivation for research

Since the outbreak of the Covid-19 pandemic in the beginning of 2020, distance learning has been skyrocketing to be the next trend in the field of education. The fact that the pandemic has forced isolation has put a massive obligation on migrating work and study online. As such, multiple platforms have been adopted to facilitate remote working and studying including Zoom, Microsoft Teams and Google Hangouts. All of these video streaming services rely on some particular video codecs for video stream encoding. In particular, both Zoom and Microsoft use H.264 (AVC) [1] [2] while Google Hangouts uses VP9 [3]. In order to maintain a seamless streaming experience for users, it is pivotal that the chosen video codec be efficient and high-performing. In various regions of the world, especially in third-world, developing countries where Internet speed is not universally competent, these factors become even more relevant and important. In such environments, Internet latency and bandwidth can be extremely limited and therefore, even a slight improvement in coding efficiency and performance can make a difference.

AV1 came out in 2018 and has become the center of attention ever since, with major technology companies putting in the research to make use of it. One of these initiatives includes the adoption of AV1 by Google Duo [4], which has seen an improvement of 30% in the case of low-bandwidth Wi-Fi and cellular networks. Users have also positively responded as the number of group conversations saw an eight-fold increase in just a month. Two years later, VVC, or H.266, came out as a successor to HEVC in order to aim for improved performance in compression as well as better support for a broader range of applications [5]. These primary objectives have also made VVC seem like another ideal candidate for bandwidth-restrained networks. Yet, the literature has not seen a lot of works that directly compare VVC and AV1. In this project, we aim to compare AV1 and VVC with the hope of determining or suggesting the most optimal candidate for low-bandwidth environments.

1.4 Related works

There has been some research in the literature that has highlighted the advantages of either AV1 or VVC. Among them, one of the most prominent ones is the work of Janusz Klink, called A Method of Codec Comparison and Selection for Good Quality Video Transmission Over Limited-Bandwidth Networks [10]. In his work, AV1, AVC (H.264) and HEVC (H.265) are compared to determine the best candidate for limited-bandwidth networks. The FFMPEG library was used to prepare and encode video samples in all codecs. The chosen video resolution was 480p, and a wide range from 500kbps to 2Mbps of coding bitrates was used. One of the most prominent features in Klink’s work is that different encoding libraries will output videos at bitrates that are slightly different from the targeted ones. Hence, there might be no 2 video samples that share the exact same bitrates when encoded. Klink, therefore, has proposed the use of spline interpolation, a numerical approximation method, in order to “fit” all the missing values. To maintain objectivity in comparison, both PSNR and SSIM metrics were used to measure the quality of each codec, as well as subjective surveys were conducted to measure user responses. Overall, AV1 dominated in both video quality achieved and in encoding efficiency, indicating that it is the most optimal choice out of all 3 for low-bandwidth networks.

2. Experiment

The experiment’s purpose is to compare AV1 and VVC in a low-bandwidth school environment.

2.1 Test data

After deriving the idea for the experiment, phase one of the experiment was to create a sample set of data that would accurately depict a school setting. To achieve this, actual live lectures given over the internet were used as test data. Our data came from courses Ben has taken over the past year. Based on the lectures taken, the data could be narrowed down into three styles of lecturing as follows:

Static slides with no face cam
Writing on slides with no face cam
Static slides with a face cam.

Although there definitely exist cases of professors writing on slides with a face cam, the potential lecture pool to pull from did not have any examples. After the cases were identified, examples of each case were found on echo 360 on Brightspace and 3 second sections of the lecture were recorded using OBS as shown in figure 1.

Figure 1: Screenshots from each test case sample

These recordings could then be downscaled to lower resolutions and framerates. For each case, we had three versions of each sample as described:

360p at 24 fps
480p at 30 fps
720 p at 30 fps

2.2 Encoders

Phase two of the experiment was to get two working implementations of the encoders. VVC can be found for free online and has an easy-to-follow installation [8]. However, this encoder requires some support, as it only takes a .yuv file as input and outputs a .266 file which can’t be decoded with conventional means. To create the .yuv files, FFmpeg converted the test files to the correct file format allowing us to use the encoder. To the group’s knowledge, there’s no way to convert the .266 file to a normal file format like .mp4, so the outputted statistics from the encoder were all that was available to use for comparisons. These statistics include PSNR values, computation time, frame generation rate, and bit rate.

AV1’s main source code is called libaom and is open source and free to use [9]. However, the documentation is hard to follow, so FFmpeg’s version of libaom was used instead. The functionality for libaom is much more user-friendly, as its input and output file types are normal, such as .mp4 or .mkv. Libaom outputs the bitrate, frame generation rate, and computation time, but does not output the PSNR or any other comparison metrics. Therefore, an online tool was used to take the output file and evaluated the PSNR values based on the original file.

2.3 Test cases

Phase three of the experiment determines what exactly will be tested and how will it be tested. The team chose to test two types of cases. First, evaluate how will real-time encoding affect the quality, and second, how the quality of the output is affected when constrained to very low bitrates.

The real-time test cases will evaluate how effectively the computers of today and the near future could encode videos. For present-day encoding, the encoder had to generate at a rate of 30 fps whereas, for the future, it had to generate at 5 fps. Given more optimized code and stronger or more specialized hardware, the 5 fps could easily become 30 fps. The output after meeting these constraints was then compared using the PSNR values.

The low bitrate test case is to evaluate how the quality of a stream will be affected if the bitrate suddenly drops. The target bitrate goal was set at ⅓ of the original bit rate. The choice of ⅓ was arbitrary since choosing a fair bitrate to test each case proved to be too challenging. The output of the two encoders was then compared against each other using the PSNR values. A more optimal approach would be to use a spline interpolation model, but the group ran out of time to implement this feature into the research.

2.4 Running the experiment

Phase four involves researching a method to fairly compare the two encoders. This is the main motivation of the work as described in [7] as comparing codecs using different settings can lead to different results. Each codec had a computation time to bitrate-ratio. VVC varied from slower to faster in 5 stages, and AV1 varied from cpu_used = 0 to 8. As the computation time increases, the output takes longer to compute but produces with a lower bitrate and similar to higher quality. After some preliminary testing, the following speed mapping was created:

Slow -> cpu_used = 1
Medium -> cpu_used = 2
Fast -> cpu_used = 5.

The first mapping was used for the low-bitrate test, as the frame creation time does not matter for this test. The second mapping was used for the near future real-time frame rate creation test, as preliminary testing showed that frames could be generated at around 5 fps. The third mapping was used for the present-day frame rate creation test, as it could usually generate 30+ fps. To allow each goal to be achieved, the quantization parameter was varied in a range of 0-63. Figure 2 shows the commands entered into the encoders for each test case.

Figure 2: reference lines used for each test case

After running all test cases, the data was compiled into a spreadsheet to draw some conclusions from our findings.

3. Results

The following section discusses the results obtained for each test case. For our results, we chose to only report on the PSNR-Y values because the PSNR-U and PSNR-V values are incorrect for the transition sample. The transition sample contains many frames without colour, and the U and V component values were zero. The VVC encoder incorrectly reported the PSNR-U and V values to be max quality instead of null value, causing the PSNR-U and V to be well over 700. The exact data obtained can be viewed in appendix A.

3.1 Min 30 fps

The goal of this test was to achieve a frame rate creation of 30. However, during the test, even with the lowest quantization value, AV1 failed to achieve the frame creation goal during the 720p case, so the 720p value should be lower for AV1 if the goal was achieved. From figure 3, it is clear that VVC outperformed AV1 in all three test cases in terms of PSNR-Y.

Figure 3: Graphs of the Min 30 FRC tests

3.2 Min 5 fps

This test was to see how effectively the current encoders could run and create 5 fps as a benchmark for the near future. Once again, AV1 during the 5 fps at 720p test failed to achieve the goal and the numbers should be lower for this case. The results for the test are more mixed than the last, with AV1 having higher PSNR-Y levels for face cam and transition samples, but VVC has slightly higher PSNR-Y for the writing sample.

Figure 4: Graphs of the Min 5 FRC tests

3.3 Low-bandwidth testing

The results from low-bandwidth testing are the most important for the experiment, as it was the main motivation for the experiment. This test compared the PSNR values of the encoders at a fixed target bitrate. From figure 5, AV1 beat VVC in every test, creating videos in higher quality at the same bitrate.

Figure 5: Graphs of the low bitrate tests

4. Discussion

Overall, AV1 seems to have an edge in most metrics and scenarios. However, there are some drawbacks to the experiments that need to be reviewed. First, the actual fps achieved is largely dependent on the computing resources of the system that conducts the encoding. As such, different systems with different computing power might produce drastically different results. In particular, a workstation with more resources is more likely to match the targeted fps.

In this experiment, AV1 samples and tests were conducted on a system that is more limited in computing power compared to the one that runs VVC tests and thus, AV1 showed less promising results than VVC. This was verified by re-running a selected number of tests on the exact system that conducted VVC trials. The result is that those particular tests have actually shown that AV1 had a performance advantage compared to VVC.

Another possible drawback is that while the work of Klink relied on a professional video quality assessment software provided by Elecard Company, our work has to use the existing, albeit more limited in features, metrics provided by the FFMEG library to derive PSNR metrics because of budget constraints. We were also not able to come up with a complete spline interpolation model to provide exact-point comparison because of time constraints as well as the complexity of the mathematical model.

5. Conclusion

In conclusion, unlike AVC and HEVC, VVC did indeed show an edge in performance under computing and network environments with better resources. Also, it performs better in terms of encoding speed, which is a key element in facilitating a seamless video streaming experience. On the other hand, AV1 has consistently outperformed VVC in low-bandwidth tests. As the main objective of this project is to determine the most ideal candidate for low-bandwidth networks for remote education purposes, AV1 still remains as a possibly better candidate given its clear advantages in low-bandwidth simulated tests.

References

[1] V. Sachdeva, “Zoom - video conf tool at scale,” Medium, 18-May-2020. [Online]. Available: https://medium.com/@vsachdeva/zoom-video-conf-tool-at-scale-e86289c290b8. [Accessed: 09-Dec-2022].

[2] Surbhigupta, “Real-time media calls and online meetings with Microsoft Teams - teams,” Real-time media calls and online meetings with Microsoft Teams - Teams | Microsoft Learn. [Online]. Available: https://learn.microsoft.com/en-us/microsoftteams/platform/bots/calls-and-meetings/real-time-media-concepts. [Accessed: 09-Dec-2022].

[3] C. Nguyen, “WebRTC - the technology that Powers Google Meet/Hangout, Facebook Messenger and discord,” Medium, 30-May-2020. [Online]. Available: https://medium.com/swlh/webrtc-the-technology-that-powers-google-meet-hangout-facebook-messenger-and-discord-cb926973d786. [Accessed: 09-Dec-2022].

[4] “Google duo seeing 8X surge in group calls, adding built-in screenshots and AV1 codec support,” Google. [Online]. Available: https://9to5google.com/2020/04/21/google-duo-new-features/. [Accessed: 09-Dec-2022].

[5] B. Bross, “Developments in international video coding ... - IEEE xplore,” ieeexplore.ieee.org, 01-Sep-2021. [Online]. Available: https://ieeexplore.ieee.org/document/9328514. [Accessed: 09-Dec-2022].

[6] B. Bross, “Overview of the Versatile Video Coding (VVC) Standard and its Applications,” IEEE Xplore, 01-Oct-2021. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9503377. [Accessed: 09-Dec-2022].

[7] Y. Chen, “An overview of coding tools in AV1: The first video codec from the Alliance for Open Media: APSIPA Transactions on Signal and Information Processing,” Cambridge Core, 24-Feb-2020. [Online]. Available: https://www.cambridge.org/core/journals/apsipa-transactions-on-signal-and-information-processing/article/an-overview-of-coding-tools-in-av1-the-first-video-codec-from-the-alliance-for-open-media/5972E00494363BE37E3439FAE382DB10. [Accessed: 09-Dec-2022].

[8] Fraunhoferhhi, “Fraunhoferhhi/VVENC: Fraunhofer versatile video encoder (VVENC),” GitHub, 2020. [Online]. Available: https://github.com/fraunhoferhhi/vvenc. [Accessed: 09-Dec-2022].

[9] A. Open Media, “Alliance for Open Media,” Google Git, 2018. [Online]. Available: https://aomedia.googlesource.com/aom/. [Accessed: 09-Dec-2022].

[10] J. Klink, “A method of codec comparison and selection for good quality video transmission over limited-bandwidth networks,” Sensors, vol. 21, no. 13, p. 4589, 2021.

Authors

Nghia is a 4th-year Computer Science student at UVic. He learned a lot about mathematical modelling and approximation methods in matching PSNR and SSIM values in one of the papers used for this project. Loves everything about optimization.

Ben Kung is a 4th-year computer science student at Uvic. His focus for the project was looking into the technical details of the encoders and designing a fair experiment to compare the encoders.

Appendix A: Link to original data results

https://docs.google.com/spreadsheets/d/1VOIr91O69KpH2sattwcLQ0rHm84TyVMc1GLSgAngMNw/edit?usp=sharing

Appendix B: Contributions to the project

Nghia: Ran AV1 samples, and collect relevant data. Calculated the PSNR values for the samples. Analyzed the results and discussed flaws. Researched related works to base the project around.

Ben: Ran VVC samples and collect relevant data. Researched the Technical aspects of each encoder. Designed and created the ideas and components needed for the experiment. Experimented how to fairly compare the two encoders. Managed and updated the website.