Milestone #3

3.1 Implementation

GANtalk successfully implemented the main two pillars of the project:

A secure end-to-end encrypted video conferencing application with feature extraction using an AI model
The GAN model that takes in sectioned data point input and produces an AI-generated image that predicts the face of the user

The video conferencing application includes full end-to-end encryption by using WebRTC that allows secure peer-to-peer communication. This is combined with a facial recognition model that was built using TensorFlow. This model recognizes the key facial points on the user's face and generates a matrix of x, y, and z coordinate data for each point extracted on the user's face. There are 468 total points and they are grouped by facial region. For example: the points that outline the user's mouth and eyes are each connected as ellipsoid shapes. The purpose of these (later solid) shapes is to provide accurate input for the GAN portion of the application.

The other part of the project is the implementation of the GAN model. The GAN was also built using Tensorflow and it is a modification of the Pix2Pix model. GANtalk trained this model on stock images of people from the waist-up so the model output would be more representative of people on a video call. Pix2Pix training works by taking a duplicate side-by-side image of a person's face, with the right-side image having each important region of the person mapped to a specific color that the model will scan. Then the model will scan through each image and learn what color represents which feature on a human.

For example, the value RGB(255, 0, 0) was used to color in the plain skin (not including features like eyes or nose) on a person's face. During training, the model scans the painted image's RGB(255, 0, 0) area (as well as every other color's area) and maps the area to the unedited picture on the left. For training, the model then learns the pixel patterns that make up the RGB(255, 0, 0) area. Thus, when a painted image is later fed through the model for prediction, the RGB(255, 0, 0) area will be predicted to match the facial plain skin feature. This process happens for every color-feature combination as the model trains on each image in the training data.

These are the facial features on which GANtalk chose to train the model as well as their corresponding color codes:

Face: (255, 0, 0) #FF0000

Hair: (0, 0, 255) #0000FF

Right Ear: (0, 255, 0) #00FF00

Left Ear: (184, 61, 186) #B83DBA

Right Eye: (255, 202, 24) #FFCA18

Left Eye: (140, 255, 251) #8CFFFB

Nose: (255, 174, 200) #FFAEC8

Upper Lip: (6, 148, 1) #069401

Lower Lip: (101, 1, 148) #650194

Left Eyebrow: (255, 132, 0) #FF8400

Right Eyebrow: (0, 149, 255) #0095FF

Neck: (32, 4, 145) #200491

Torso: (52, 230, 162) #34E6A2

Background: (255, 242, 0) #FFF200

For training and tuning, a set of 40+ images were colored in manually using Microsoft Paint 3D:

For the app's implementation, however the shapes that get generated from the AI facial feature extraction in the video chat app get filled by a middleware program. This program takes the shapes from the facial point extraction and fills in the shapes with the corresponding color. This part, surprisingly, uses no AI as the same points will always map to the same distinct facial feature. This way, the middleware can fill in the same color for each group of points for every input image that comes from the video chat app. The only difference will be the distortion of the shapes since each frame collected from the app will be different.

For example, some people with a larger forehead area will have the RGB(255, 0, 0) area larger than people with a smaller forehead. This also applies to facial features that are not shown. If the user has his eye closed and the feature extraction AI cannot recognize that part as an eye, the model will not generate feature points for that area and will instead just label it as normal facial skin area: RGB(255, 0, 0). This phenomenon can be seen in the images that do not show certain features.

These images show examples of ears not getting mapped to a color since the ears are not present in the original image. The model will only generate a feature if it is present in the colored input image. This example was especially prevalent with images of women with long hair, such as the above picture on the right. Since the woman's ears cannot be seen in the image, the facial feature extraction AI will not map any points to the ears, and thus the model will not be told to predict ears in the reconstruction image.

The aforementioned reconstruction image is the output of the GAN model. The GAN takes in the lone colored image and produces the face based on each color region that it has learned to construct.

The above image sequences are examples of an input color mapping, the photo used to make its color mapping, and the output generated from the color mapping. The quality of these images largely depends on how many epochs the model trained on during the training process. For those unfamiliar with machine learning, an epoch represents a model cycling through the training data once and learning from it. The above images were predicted after training for about 470 epochs. GANtalk was careful not to overfit the model on the training data. Although the size of the training data seems small (~50 images), it produced an adequate result for the amount of time and computational power (RAM and GPU costs) spent training the model. The Pix2Pix model is also noted to produce realistic results with 30-50 images with 500-1000 epochs of training.

3.2 Test

The team conducted a series of comprehensive tests for the webRTC peer-peer communications and facial landmark extraction portions. Likewise, in order to measure the efficiency of our model, the team devised a special algorithm used to measure throughput and overall data sent throughout the duration of a call. Screenshots of the application interface along with the data gathering system can be seen in the figures below.

GANtalk's GAN model successfully generates faces based on the colorized input from the video chat app. These images, however, are not of the best quality representation of the input user. The team actually predicted these results since GANs are known to produce unstable results.

One can see the model's accuracy progression below after certain numbers of epochs on a couple of testing images:

Row 1: ~10 Epochs

Row 2: ~250 Epochs

Row 3: ~500 Epochs

Row 4: ~650 Epochs

One option to fix this lack of accuracy was to use another deep learning model known as an autoencoder. After careful planning and analysis, the team decided to not use an autoencoder since none of the engineers had experience with autoencoders and it would not be worth to spend time researching, developing, training, and integrating a completely new model with the possibility of getting similar results as before.

3.3 Teamwork

GANtalk's engineering team worked well together and produced a working product by the final deadline. Each member put in his fair share of work:

Andrew: Peer-to-Peer Video Chat App Development, Facial Feature Extraction Model Development

Corey: Data Engineering, GAN Model Development, Tuning, and Testing

Matthew: Data Engineering, GAN Model Development, Tuning, and Testing

Frank: Data Engineering, GAN Model Development, Tuning, and Testing

The team would frequently (1-3 times per week) conduct stand-up meetings to discuss progress and assign tasks. The team would then work on the assigned tasks and report results at the next meeting. An equal amount of work was done individually as with the whole team.