Optimizing Immersive Video Coding Configurations Using Deep Learning: A Case Study on TMIV []
Abstract: Immersive video streaming technologies improve Virtual Reality (VR) user experience by providing users more intuitive ways to move in simulated worlds, e.g., with 6 Degree-of-Freedom (6DoF) interaction mode. A naive method to achieve 6DoF is deploying cameras at numerous different positions and orientations that may be required based on users' movement, which unfortunately is expensive, tedious, and inefficient. A better solution for realizing 6DoF interactions is to synthesize target views on-the-fly from a limited number of source views. While such view synthesis is enabled by the recent Test Model for Immersive Video (TMIV) codec, TMIV dictates manually-composed configurations, which cannot exercise the tradeoff among video quality, decoding time, and bandwidth consumption. In this article, we study the limitation of TMIV and solve its configuration optimization problem by searching for the optimal configuration in a huge configuration space. We first identify the critical parameters in the TMIV configurations. Then, we introduce two Neural Network (NN)-based algorithms from two heterogeneous aspects: (i) a Convolutional Neural Network (CNN) algorithm solving a regression problem and (ii) a Deep Reinforcement Learning (DRL) algorithm solving a decision making problem, respectively. We conduct both objective and subjective experiments to evaluate the CNN and DRL algorithms on two diverse datasets: an equirectangular and a perspective projection datasets. The objective evaluations reveal that both algorithms significantly outperform the default configurations. In particular, with the \TseHou{equirectangular (perspective) projection dataset, the proposed algorithms only require 95\% (23\%) decoding time, stream 79\% (23\%) views, and improve the utility by 6\% (73\%)} on average. The subjective evaluations confirm the proposed algorithms consume less resources while achieving comparable Quality of Experience (QoE) than the default and the optimal TMIV configurations.
Look at Me! Correcting Eye Gaze in Live Video Communication [pdf]
Abstract: Although live video communication is widely used, it is generally less engaging than face-to-face communication because of limitations on social, emotional, and haptic feedback. Missing eye contact is one such problem caused by the physical deviation between the screen and camera on a device. Manipulating video frames to correct eye gaze is a solution to this problem. In this paper, we introduce a system to rotate the eyeball of a local participant before the video frame is sent to the remote side. It adopts a warping-based convolutional neural network to relocate pixels in eye regions. To improve visual quality, we minimize the L2 distance between the ground truths and warped eyes. We also present several newly designed loss functions to help network training. These new loss functions are designed to preserve the shape of eye structures and minimize color changes around the periphery of eye regions. To evaluate the presented network and loss functions, we objectively and subjectively compared results generated by our system and the state-of-the-art, DeepWarp, in relation to two datasets. The experimental results demonstrated the effectiveness of our system. In addition, we showed that our system can perform eye gaze correction in real time on a consumer-level laptop. Because of the quality and efficiency of the system, gaze correction by postprocessing through this system is a feasible solution to the problem of missing eye contact in video communication.
Realizing the Real-time Gaze Redirection System with Convolutional Neural Network [pdf]
Abstract: Retaining eye contact of remote users is a critical issue in video conferencing systems because of parallax caused by the physical distance between a screen and a camera. To achieve this objective, we present a real-time gaze redirection system called Flx-gaze to post-process each video frame before sending it to the remote end. Specifically, we relocate and relight the pixels representing eyes by using a convolutional neural network (CNN). To prevent visual artifacts during manipulation, we minimize not only the L2 loss function but also four novel loss functions when training the network. Two of them retain the rigidity of eyeballs and eyelids; and the other two prevent color discontinuity on the eye peripheries. By leveraging the CPU and the GPU resources, our implementation achieves real-time performance (i.e., 31 frames per second). Experimental results show that the gazes redirected by our system are of high quality under this restrict time constraint. We also conducted an objective evaluation of our system by measuring the peak signal-to-noise ratio (PSNR) between the real and the synthesized images.
Screencast in the Wild: Performance and Limitations [pdf, poster]
Abstract: Displays without associated computing devices are increasingly more popular, and the binding between computing devices and displays is no longer one-to-one but more dynamic and adaptive. Screencast technologies enable such dynamic binding over ad hoc one-hop networks or Wi-Fi access points. In this paper, we design and conduct the first detailed measurement study on the performance of state-of-the-art screencast technologies. By varying the user demands and network conditions, we find that Splashtop and Miracast outperform other screencast technologies under typical setups. Our experiments also show that the screencast technologies either: (i) do not dynamically adjust bitrate or (ii) employ a suboptimal adaptation strategy. The developers of future screencast technologies are suggested to pay more attentions on the bitrate adaptation strategy, e.g., by leveraging cross-layer optimization paradigm.
Screencast Dissected: Performance Measurements and Design Considerations [pdf, slides]
Abstract: Dynamic and adaptive binding between computing devices and displays is increasingly more popular, and screencast technologies enable such binding over wireless networks. In this paper, we design and conduct the first detailed measurement study on the performance of the state-of-the-art screencast technologies. Several commercial and one opensource screencast technologies are considered in our detailed analysis, which leads to several insights: (i) there is no single winning screencast technology, indicating rooms to further enhance the screencast technologies, (ii) hardware video encoders significantly reduce the CPU usage at the expense of slightly higher GPU usage and end-to-end delay, and should be adopted in future screencast technologies, (iii) comprehensive error resilience tools are needed as wireless communication is vulnerable to packet loss, (iv) emerging video codecs designed for screen contents lead to better Quality of Experience (QoE) of screencast, and (v) rate adaptation mechanisms are critical to avoiding degraded QoE due to network dynamics. Furthermore, our measurement methodology and open-source screencast platform allow researchers and developers to quantitatively evaluate other design considerations, which will lead to optimized screencast technologies.
Toward an Adaptive Screencast Platform: Measurement and Optimization [pdf]
Abstract: The binding between computing devices and displays is becoming dynamic and adaptive, and screencast technologies enable such binding over wireless networks. In this article, we design and conduct the first detailed measurement study on the performance of the state-of-the-art screencast technologies. Several commercial and one open-source screencast technologies are considered in our detailed analysis, which leads to several insights: (i) there is no single winning screencast technology, indicating room to further enhance the screencast technologies, (ii) hardware video encoders significantly reduce the CPU usage at the expense of slightly higher GPU usage and end-to-end delay, and should be adopted in future screencast technologies, (iii) comprehensive error resilience tools are needed as wireless communication is vulnerable to packet loss, (iv) emerging video codecs designed for screen contents lead to better Quality of Experience (QoE) of screencast, and (v) rate adaptation mechanisms are critical to avoiding degraded QoE due to network dynamics. As a case study, we propose a non-intrusive yet accurate available bandwidth estimation mechanism. Real experiments demonstrate the practicality and efficiency of our proposed solution. Our measurement methodology, open-source screencast platform, and case study allow researchers and developers to quantitatively evaluate other design considerations, which will lead to optimized screencast technologies.
Is Foveated Rendering Perceivable in Virtual Reality? Exploring the Efficiency and Consistency of ality Assessment Methods [pdf, poster]
Abstract: Foveated rendering leverages human visual system to increase video quality under limited computing resources for Virtual Reality (VR). More specifically, it increases the frame rate and the video quality of the foveal vision via lowering the resolution of the peripheral vision. Optimizing foveated rendering systems is, however, not an easy task, because there are numerous parameters that need to be carefully chosen, such as the number of layers, the eccentricity degrees, and the resolution of the peripheral region. Furthermore, there is no standard and efficient way to evaluate the Quality of Experiment (QoE) of foveated rendering systems. In this paper, we propose a framework to compare the performance of different subjective assessment methods on foveated rendering systems. We consider two performance metrics: efficiency and consistency, using the perceptual ratio, which is the probability of the foveated rendering is perceivable by users. A regression model is proposed to model the relationship between the human perceived quality and foveated rendering parameters. Our comprehensive study and analysis reveal several insights: 1) there is no absolute superior subjective assessment method, 2) subjects need to make more observations to confirm the foveated rendering is imperceptible than perceptible, 3) subjects barely notice the foveated rendering with an eccentricity degree of 7.5◦+ and peripheral region of a resolution of 540p+, and 4) QoE levels are highly dependent on the individuals and scenes. Our findings are crucial for optimizing the foveated rendering systems for future VR applications.