Pakistan’s Hall of Fame

Project Abstract

The concept is to create an AR based android application using image processing and machine learning techniques, that makes a still image of Pakistani heroes, that were a part of Independence Movement of Pakistan, look like they are talking with audio generation and lip movements synced over that audio.

Introduction

Augmented reality (AR) takes the objects of the real world and ‘augment’ them i.e., creating a sense of reality out of the computer-generated idea of the world. AR is basically a reality-based interactive environment which takes information of the world like audio, video, effects etc., and processes it to enhance the user world’s experience. AR in learning and education sector has proven to be very efficient as it is being used by teachers to explain the course contents, take quizzes etc. The idea is to develop an AR based android application using image processing and machine learning techniques, that brings images to life which enhances the learning process and provides a better learning experience regarding the Independence Movement of Pakistan. Personalities, telling their history and life’s work along with their sacrifices and achievements, makes the application overall attractive and it’s a fun way to learn the history. The main inspiration behind our project was the mainstream use of different applications by our youth. Research has found that most people spend a very significant part of the day interacting with different applications. It gives them an element of joy and enhances various features. By taking that idea we moved forward with creating something that would help the youth not waste those hours on these applications but also spend it in learning something new. Our project would help people learn about Pakistan's history, our culture and roots and it will help people gain knowledge about the people behind the formation of our country, especially when the famous personalities are themselves telling their tales.

The application begins by detecting a face in a still image and recognizes it by using a neural network. Once recognized, the image is fed to the Generative Adversarial Network(GAN), which is one of its two inputs. The other input is an audio file that is in the database against the ID of the recognized face. Once the second input, the audio, is fed as well, the GAN then creates multiple copies of that image, based on the length of the audio, with slight fluctuations of facial landmarks specifically the lip region to make it look like its moving. In the end, it gives an output of a video in which the images move frame by frame smoothly synced over the given audio file.

Github Repo

GAN

Generative adversarial networks (GANs) are an approach to generative modeling using deep learning methods, such as convolution neural networks. Generative modeling is an unsupervised learning task in machine learning that involves automatically discovering and learning the regularities or patterns in input data in such a way that the model can be used to generate or output new examples that plausibly could have been drawn from the original dataset. Learn More.

Results

WhatsApp Video 2019-11-08 at 4.26.07 AM.mp4

Facial Recognition

It displays the probability of the recognized face as well.

output.mp4

Video Output

This is one of the initial results with lesser facial movements and more unreal surrounding movements.

results.mov

Improved Output

This results displays landmark movement in cheek region which gives a rather realistic view.

Challenges We Faced

We had multiple problems that we faced during the entire project, the solutions of which are discussed below:

State-of-the-art constraints

GANs are fairly new technologies and it requires a great deal of understanding to create and work on such convolutional neural networks.

CPU/GPU Issues

The GAN we used required a atleast over 16 GB of RAM and minimum 1050 GTX of GPU. Hence we were not able to run it locally hence we used Google Collab which turned out to be a better option.

Training

GAN we used for the project requires a minimum training of over 500 hours according to the paper. The excessive training of the model required more than just time resources which were a huge hurdle.

Model deployment

The biggest issue we faced was how to port the application on a mobile device as the model is too big to be run on such platforms. Two possible solutions are, either transform the model into another model, a rather lighter one, that can run on a mobile device, or serve the model and get requests from the mobile application and respond accordingly. Both the choices have its pros and cons as discussed below:

Port to Android

The GAN model is pretty extensive and big which makes it harder to deploy it on android/iOS platforms. If we deploy the model on mobile, we would need to convert our pytorch model into tensorflow and then into tensorflow lite model so its flexible and light enough to work on a mobile phone which in turn also reduces the quality of the resulting video. This would be the finest solution. Doing so will be entirely too time-consuming but it will make the model more efficient response-wise. Following are the steps that one can take to make the model port on mobile(Android/iOS).

i. Choose model

The GAN model we used is pytorch-based and has pretrained model available. To use such a model, tensorflow provides solutions of retraining the model using tensorflow library.

ii. Train the model

After the library has changed, the model needs to be trained in tensorflow. set the weights and required necessities and train the model.

iii. .tf to .tflite

There are multiple inputs that tensorflow lite accepts like frozen graphs, saved models etc. To understand the concepts thoroughly, learn more from tensorflow website.

2. Serve The Model

The other option is to serve the model but serving it causes too long to respond. To serve the model, following steps can be used:

i. Establish a Server

First, establish a server. For us, we created a server on on of our PCs that has Ubuntu as the OS using many tutorials freely available on the internet.

ii. Communication

Once the model is placed on the server, create an android application and communicate with the server using Php or other relative framework.

Choosing server as our option, we are able to get the respective result. But if the output video generation takes too much time, the server timeout can be a huge issue. The solution can be to use web sockets or pushers. Web sockets are persistent connections between client and server which can help with the timeout scenarios. Pushers help with real-time data communication. Learn more about Web Sockets and Pushers.

Similar Work

The related work or the application through which we had our own idea is 19 Crimes. It is a wine company which has developed its own application for promotion or marketing purposes. The pictures on the wine bottles can be scanned through the application and the people in those pictures start telling their crimes through Augmented Reality. This is an application for Apple platform. Another application that serves the same purpose is an Android-based app called Living Wine Labels. These wine application are constricted to work only with the same picture for any particular wine bottle.

Crazy Talk is an AR application software that creates 2D avatars with body motions. However, the software requires a still image for the purpose instead of a real-time object.

Blabberize is another AR application which creates 2D avatars and has the same limitation as Crazy Talk.

SpeakPic uses text-to-speech technique and creates a lip movement but again, it requires a picture for that.

HippoMagic creates a model of the whole scenario of a page of the book rather than making a talking model of the particular figures/actors in them.

Poster

Final Presentation

Achievements

Pakistan's Hall of Fame

We scored 3rd Position among 101 Final Year Projects submitted in COMSATS University Islamabad, Lahore Campus.

DICE IET INNOVATION EVENT 2020

The project stood 1st among 350+ projects from all across Pakistan and received the winning prize of One Hundred Thousand(PKR).

Supervisor

Dr. Usama Ijaz Bajwa

Associate Head of Department / Assistant Professor, Computer Science. Image Processing and Computer Vision (Biometrics, Medical Image Analysis, Video Analytics)

Team Members

Ata Ullah Butt

Email: ata.aub@gmail.com

LinkedIn: Ata Ullah Butt

Rija Tariq Lodhi

Email: rijatariqlodhi@gmail.com

LinkedIn: Rija Tariq Lodhi

Hasham Alam

Email: m.hashamalam@gmail.com

References:

For GAN knowledge from scratch:

Similar lip syncing GANs: