Milestone #2

Project Plan

The primary focus of GANtalk is to provide proof of concept data for a potential alternative to traditional video codecs used in video calling. To accomplish this goal, we decided on a simple plan that breaks the project down into three main phases: Design, Test Builds, and Analysis. As may be observed below in the section "Task Breakdown", each phase includes multiple work items that build on top of each other. Overall, the team speculates that the majority of the time spent working will be dedicated to the Model Testing phase due to the steep learning curve and extensive complexities of the tasks. After the necessary GAN is built out, it is our intention to run a series of analyses on both the model and a control WebRTC video chat and cross compare performance metrics under a variety of conditions.

Task Breakdown

  • Design

    • Model Selection

      • Compile a list of open source GANs that show potential

      • Run test cases on each GAN to determine which one is best suited for the project mission

  • Test Builds

    • Image-to-Image

      • Adjust the GAN to synthesize an output image from a given input image. The output image should directly reflect that of the input image.

    • Image-to-Video

      • Adjust the GAN to synthesize video frames given an input image and varying facial coordinates.

    • Video-to-Video

      • Adjust the GAN to synthesize video from a given real-time video stream taken from a computer webcam.

  • Analysis

    • Control Build

      • Build a simple peer-peer WebRTC video chat to act as a control for statistical comparison between video sent using H.264 and video sent using the GAN.

    • Data Acquisition

      • Measure network statistics of the model and control. Measurements include throughput, latency, data rate, etc

    • Statistical Comparison

      • Demonstrate results comparison of the GAN method vs the WebRTC control using various charts and graphs.

Software and Hardware Specifications

  • NVIDIA GPU for training models: The CUDA library makes training significantly faster using parallel computations which is the industry standard but is only compatible with NVIDIA GPUs. Current competitors exist but do not meet the same performance as CUDA, and many of the pre-trained models we are looking at require CUDA to execute or to fine-tune the model.

  • Python is a definite requirement, once we have settled on a specific model to fine-tune we will know which version of Python is necessary and which version of TensorFlow and the associated libraries we will be using to create and run our model. The current implementation of StyleGan2 we are using requires Python3.6 and TensorFlow 1.X

Design Concepts

Many people have friends or relatives living far away from them. Most of these people do not have the money required to visit them very often. Although texting or calling can be a cheap way to communicate with people who are far away, it lacks the personal connection of meeting someone face-to-face. Video calls are, therefore, very important as they allow a cost-effective way to talk to someone who is far away, while still providing a personal connection. The primary bottleneck of this solution, however, is the bandwidth in the environment. Video calls are notorious for their subpar quality and frequent interruptions if the bandwidth is anything less than stellar.

Our product is a software application that uses computer vision with a GAN to breakdown each frame of the video into a series of data points that modify the previous image on the receiving computer, which creates the next frame of the video. Since the video is now sent as a series of data points instead of a series of full images, the amount of data that needs to be sent is drastically reduced. This should solve both the problems of subpar quality and interruptions in low bandwidth environments.

System Diagram and Process Flowchart

Test Plan

Testing the success of our model will be done in multiple phases. The first is our image-to-image translation accuracy, the main role of the model will be to synthesize an image after facial movement with a feature set and the original image, and the accuracy will be tested using a video and finding the loss between the synthesized image and the ground truth image frame from the video. The second phase of testing will be on how quickly our model will be able to synthesize images, the speed of our model will be directly proportional to the frames-per-second(FPS) of our output video. The final test will require us to test real time video-to-video where we will measure video framerate for quality.