Face Recognition

Objectives

Understand basic machine learning concepts and apply them to real-world problems.
Learn how to use Linux for personal projects and remote applications on AWS / Azure.
Gain familiarity with Python and the TensorFlow/Keras module for scientific computing purposes.
Understand the theory behind deep learning and how the networks function.
Gain a deeper understanding of how CNNs function and why their architecture functions from a neural viewpoint.
Learn how to build (convolutional) neural networks in Keras.
Gain familiarity with how web servers work and use the Flask module to create a web server for remote querying.

Requirements

Laptop (with SSH capabilities)
Raspberry Pi
- Keyboard
- Mouse
- External Monitor + HDMI Cable
- Micro USB Power Adapter
Raspberry Pi Camera
Memory Card Reader (for flashing images)

Repository Link

Note from the Developers

The face recognition project is aimed at students who have an interest in neural networks but have not had a chance to implement them for a specific use case. Students will implement a face recognition IoT module and gain experience working with Python, Keras, and OpenCV to further solidify their machine learning background. Although the project focuses on an implementation of a popular CNN (Convolutional Neural Network), the intuition behind many other networks stems from the same ideas. We hope that you will find the information following of use in your future career.

Sincerely,

FR Team

System Overview

There are three major phases to this project. The first phase is intended to provide enough background experience in working with Python and various modules commonly used for machine learning tasks. The second then take this knowledge and builds upon it by incorporating the usage of cloud computing for the purpose of setting up the face recognition system. Finally, the last phase takes all of the ideas from before and tasks students with building a face recognition from the ground-up that allows for remote queries.

More specifically, in the first phase, you will focus on learning the core concepts of machine learning and why neural networks work. You will then learn about convolutional neural networks and why they are functionally different from a vanilla neural network. After this, you will gain experience in Python and a couple libraries by writing a couple scripts that incorporate PiCamera usage. This will form the baseline knowledge expected for the other projects. While the Python introduction is not comprehensive, you are strongly encouraged to practice writing scripts on your own time (or the later module might become a bit difficult).

The second and third phase are heavily related to each other so they will be explained together. In the second phase, you will be collecting images of yourself based off of code written earlier in order to incorporate into a dataset for usage in the CNN. You will then split this dataset using a script you design in order to generate three different subsets of the dataset. Finally, you will create a simple CNN using the Keras module in hopes that doing so allows you to understand the programming paradigm associated with the Keras module. The third phase extends on this by tasking you with implementing a larger, much more popular network, VGG16, for use with your created subsets. You will "train" this network and then incorporate it into a system that allows for remote face querying. While the remote portion is largely provided, you are encouraged to look over the code as the Flask module used is simple and intuitive to understand.

If you are planning to use Azure instead of AWS for whatever reason, be sure to use the Azure environment setup instead of the following default AWS setup. There are quite a few differences, but overall the setup is equivalent to a more verbose AWS instance setup. Be sure to take care when choosing an OS for the virtual environment in Azure as they have more control over the system preferences than an AMI.

Local Setup

In this part, we seek to setup both the RPi and potentially the student's personal computer in order to create the environment necessary for the project to succeed. In an effort to make sure that this page does not become cluttered with installation information, take a look here for what is required for the project on the Raspberry Pi and a method of installation that works on Linux and Windows (although local setup on a student's PC is not strictly required).

Note that installation on a Raspberry Pi (RPi) can take some time as the software needed can be a bit large and data transfer rates on a micro SD card are quite limited. This means that you should feel free to progress in the project while setting up the RPi. This will ensure that the least amount of time is wasted while waiting for software to be downloaded and copied.

As a final step, make sure that you duplicate the repository using your own Github account either by manually cloning and adjusting the remote or by using the following Github page in order to import the Face Recognition repository. Simply paste the original repository link into the URL and name the project whatever you like. Making a private repository would be preferred in this scenario as this project is meant to be a starting point only.

Remote Setup

This next part consists of teaching students how to use either the Azure or AWS EC2 interface in order to create provisions which can run TensorFlow + Keras on Python 3. Follow any of the hyperlinks provided in the last sentence depending on the web server of your choice and print the current version of Keras on the terminal screen. This can be done with the following lines:

python3

from tensorflow import keras

print(keras.__version__)

Once you are sure that the version of Keras is indeed 2.0 or higher, you may begin with the rest of the Face Recognition project. Note that the rest of the project will be divided into several stages, the first of which involves a thorough introduction to Python and the theory behind neural networks (and their relationship to CNNs). Since you need access to your own Github repository at all points, you will need to diligently manage your repository as you continue with this project.

From this point on, good luck!

Introduction to Python

Please read all of the required material from the Intro to Python sub-page here and then come back to complete the following assignment.

Hello, Terminal!

As proof that you have finished this section, use Python to implement a script that prints out the current time for the next ten seconds every second and then terminates with a "Goodbye, World". Additionally, this program should only run when called through the terminal. That is, if you were to open up the Python interpreter and import your script, you should receive no output.

For this exercise, I suggest looking at the datetime module. You are free to print the time any way you want (including Unix time if desired). As for the extra constraint of running only in the terminal, I suggest looking at this Stack Overflow page on the particular construct to use. Feel free to ask any questions regarding either the API or the construct.

File Relocation

One important part of this project requires the shuffling of files into random directories. The purpose of this is to ensure that our network has a training, validation, and testing set that all cover sufficient variability to properly describe performance on a potential unseen dataset. For this part, write a Python script that takes the original images found in the /images directory and randomly shuffles them into a training, validation, and testing dataset with corresponding percentages. Note that the input dataset is already organized by classes. Since all the new datasets will still be organized in the same way, the new script should simply be copying old files to the three new datasets generated under the /Data/Train, /Data/Validation, and /Data/Test directories. For convenience, name this script 'dataSplitter.py' and place it wherever you like (notice that the base is in the /Project1 directory). The output file structure required can be visualized in the image that follows.

Figure 1: The desired output file structure for dataSplitter.py.

In order to assist you in this task, there are a couple functions which could be of use when creating this script. There are as follows (most of these are built-in libraries that can easily be found in the official Python documentation!):

glob.glob
- The glob module provides Linux globbing features as a module. This involves the use of wildcards for filename matching.
os.path
- os provides manipulation tools for path queries that include path concatenation, path splitting, and file path checking.
shutil.copyfile
- shutil provides general tools for manipulating files directly such as moving and copying. shutil.copyfile copies a file from one path to another.
os.[listdir/makedir]
- The os package provides many general tools for file manipulation and creation. The two most useful in this scenario would be the makedir and listdir functions, which make directories and lists all files under the directory, respectively.

While the order of the data being read doesn't matter since it is shuffled every epoch during model training, there should be some randomization (per directory) when it comes to the choice of images placed in each dataset. In other words, the randomization should take place per directory as opposed to randomly shuffling all images and then sending them to the proper directory. The reason for this is because this dataset in particular is defined in such a way such that the degree of light which each face experiences is ordered. As a result, a covariate shift occur and reduce the performance of the model trained. We would like to ensure that this is not the case so make sure that the way the images are distributed is random within each class!

After finishing, use the provided /datasetCheck.py script in order to verify that your directory has properly split the files between the three datasets. The script has a help page that can be easily seen by running the script with no arguments. Once the checker has verified that the files have been properly copied into the dataset directories, feel free to continue. Don't worry about your own images yet; you will rerun the script once you have incorporated your own set of images into the dataset.

Image Capturing and Modification via OpenCV

In this section, we explore the OpenCV module and use it to manipulate an image and interface with the camera to localize faces. For those curious, we will be using a Haar cascade composed of pre-trained weak classifiers generated from Haar features. Note that although the purpose of this project is to recognize faces, we do not ask the network to localize them first. Due to this inability to localize faces, we use the Haar cascade as a sort of region proposal component for the network.

OpenCV Basics

OpenCV is an open-source computer vision Python module. It contains some very basic and not so basic algorithms for image manipulation, object localization/detection, and optimization. It also happens to also include support for camera usage; however, since the RPi camera functions slightly differently from normal cameras, it is not supported by the OpenCV libraries natively. In order to get around this, we will be using another module that works very similarly to the native library.

We seek to first get accustomed to the API and the nature of the returns by first working on some basic image manipulation tasks. For the first assignment, use the template of Geisel Library in the project repository as a starting point. First, display the image of Geisel and reshape it so that it is 50% of it's original size. Next, draw a bounding box over some part of the building (it does not need to be precise). Finally, save the output image as "geiselBox.jpg". You do not need to put too much effort into placing the box over the entirety of Geisel or whatever part you would like boxed. The purpose of this section is simply to introduce you to some basic functions and how positioning works in OpenCV. Make sure your output looks something similar to what follows:

Figure 2: The Geisel image above is an example of what can be produced using the OpenCV API. Here, a rectangular feature was highlighted by the rectangle, but the exact placement can be anywhere there is a distinct feature.

The starter code can be found in /project1/process_geisel.py. The instructions contained within the file are the same as those listed here. It is placed there mostly for your convenience. (That will be the case for most of these future starter files.)

RPi Camera Usage

We will be using the PiCamera module in order to interface with the camera on the Raspberry Pi. By default, any version of Raspbian comes with the PiCamera Python module. If you cannot import the module properly, consult with the tutor/TA as something may be wrong with either the flashed image or installation was carried out improperly. Feel free to consult the PiCamera API if at any point you have any questions.

For this exercise, please implement a Python script that displays the output of the PiCamera along with a rectangle that encompasses most of the image. Note that most of the code is already provided in the project repository as /project2/process_realtime.py. This task is mostly to ensure that the camera is functioning properly and for students to become accustomed to the API. If no output is seen and the camera is correctly connected as in the following image, call a TA as it might be the case that either the camera and/or RPi connector has been damaged. The resolution for the camera is set by default to (640,480), but feel free to change it to change it to any size for initial testing as the face detection algorithm runtime is dependent on image size. The only constraint that is enforced on the camera resolution is that it's aspect ratio remain 16:9.

Real-Time Face Detection

With a very minimal introduction to the OpenCV API and knowledge of how to process a RPi camera stream, the next task is to now localize faces within an image. To do this, we will be following this guide to understand how the Haar Cascade classifier is initialized and used in order to detect faces. Feel free to read up on the theory of how the algorithm works, but it is not required that you understand how the weights were generated. It will be used as a starting point for the discussion about neural networks in a future section but covered in depth. For now, focus on the implementation of the face detection algorithm.

For this next assignment, use the camera stream to detect any faces and, if possible, draw a rectangle over the entire proposed face region. In general, the process for determining whether a face exists would be as follows:

Set up the classifier and load the weights for the face classifier
While the camera feed generates images:
1. Run the cascade classifier on the image to generate proposed face regions.
2. For every proposed face region:
  1. Draw a rectangle in the part of the image that corresponds to the face.
3. Display the image.

Please modify your process_realtime.py file in order to follow this algorithm (call the new file face_detector.py). At any point if a face is upright in a frame, you should ideally see a rectangle around the face. If you still haven't been checked off for a previous part of the assignment, then feel free to make a copy and modify the copy instead.

Face Collection

With the above completed, it is easy to extend the system to now save pictures of your face. Recall that the purpose of this project is to create a remotely query-able recognition system for people who have been seen before. That is, we are guaranteeing that the person seen has been seen before and is thus present in the training set for the network (the problem of identifying new people as they come in is a completely different problem in machine learning!).

To be able to identify ourselves then, requires that we have a massive dataset of our own faces to train the network with. For this section, make a copy of the face_detector.py file and modify the code in order to save the cropped grayscale images of your face in different poses. This will naturally require the use of a minor delay in order to ensure the images are different enough to capture a sufficient degree of intraclass variation for you and your partner's face. [On the topic of sufficient, about 100-200 images is usually enough to capture this variability with a moderate snapshot delay.] Furthermore, since the faces in the original dataset range from 200x200 and larger squares, you are also expected to have images at least this large for your entire dataset.

Finally, one of the most important things to recognize is that the network itself was trained on square images. Since the warping of training images may actually harm the generalization procedure, it is required that students modify the aspect ratio of the saved image to be 1:1 (that is, your image must be a square). You may use any method you like in order to enforce this requirement, but remember that the face should cover a significant portion of the image in order to be better interpreted by the neural network due to the nature of the provided dataset. An example of a properly formatted image given an input frame (in this case, the Lenna image) can be seen below [notice that the image is extracted with equivalent height and width for the region of interest]:

Figure 3: Notice that for the given image above, the extracted region of interest is only the face detected by the Haar cascade. For your dataset, you will need to collected these types of crops while ensuring that the cropped width and height remain above 200px.

All of these images should be saved inside two new directories within the /images/ directory: /images/17 and /images/18. These will represent the face classes for you and your partner's face. The images within the directories can be saved however the group wants, but obviously an incremental numbering scheme will be easiest to implement for the saving. Note that this means that you will need to keep track of how many images you have currently saved for the session! Take care of how you manage your variables otherwise you will need to include global variable declarations in your functions.

Dataset Generation

In the following section, we will cover the generation of the dataset for use in the face recognition project. In order to ease the process of data collection (and prevent some nasty bugs from showing up as a result of a ill-formed dataset), the repository provides a starter dataset to which students can append their personal images. Note that there will be some intuition given for dataset generation, and students will be expected to look at the possible datasets and be able to explain why the dataset given works for this project.

Relation to Machine Learning Models

Note that the machine learning methods covered in this project (Haar cascade classifier / neural networks) are supervised machine learning techniques. Supervised learning techniques are models that use both the input and measured output in order to train a model. To be more precise, these models are discriminative learning techniques since they try to model the probability of the input directly from the input. In probabilistic terms, this amounts to estimating the conditional distribution of the outputs with respect to the input, P(Y|X;Θ), through some unspecified learned feature function, Φ(X,Y;Θ). It is for this reason that the dataset chosen is extremely important. Since this function's parameters, Θ, are estimated directly with respect to a given set of given inputs and outputs, the inputs and outputs must be representative enough to properly form a satisfactory estimate of the conditional distribution through Φ(∝ Θ). Sadly, in most practical applications, the outputs are often acquired through measurements that incorporate noise. Furthermore, we often have no idea of the form of Φ, so creating a particular function type along with an unknown parameter set is even more difficult. To narrow things down, we will be considering how datasets affect the multi-class classification problem directly. All this means is that we enforce that the cardinality of the set of possible classes Y be greater than one, or |Y|>1, and that Y be a set of discrete labels each representing a mutually exclusive class (that is, there can be no overlap between classes).

So, with the above, we have motivated the discussion of the choice of datasets for our supervised classification task. But what are some things that are intrinsic to the dataset that are required in order to successfully train a multi-class classification model? There should be three things considered for most datasets assuming the data collected cannot be completely described by the measurements (or images, in this case):

Intra-Class Variation
Inter-Class Variation
Dataset Biases

In order to get a better understanding of the larger picture, each of these terms will be discussed along with their implications. In order to have a sufficient amount of intra-class variability, the classes themselves must have a sufficient amount of variation to describe the classes individually. For example, in our problem of classifying faces, intra-class variation would be represented through the varying light exposure, face position, face pose, negative space details, and so on. Essentially, anything that makes one picture of your face different from another picture of your face is an attribute that would contribute to intra-class variation. Without a sufficient level of intra-class variability, something known as the covariate shift (the testing dataset distribution is not the same as the testing or real-world distribution) could occur when evaluating, and the testing performance of the model will be low. Consider an example of this in Fig. 4 for a single class in a 2D space, where the given ellipse represents the true domain space:

Figure 4-A: An example where the collected data was far too concentrated with respect to the data domain.

Figure 4-B: An example where the collected data has far too many outliers. This may lead to a misrepresentation of the class.

Figure 4-C: An example of a decent distribution for this particular domain space. Few outliers are present and the domain is well represented.

In order to have a sufficient amount of inter-class variability, the classes should have a sufficient amount of variation to describe each class with respect to one another. In other words, the features that would help a typical person distinguish one face from another would contribute to this inter-class variation. For our particular task, this should ideally only be dependent on facial structure. With an insufficient amount of inter-class variability, it is easy to see that classes will become indistinguishable from each other and the model with perform very poorly. Note that in practice this often overlaps with some attributes that would normally contribute to inter-class variability since our datasets are not infinitely large. Consider an example of this in the below Fig. 5 for a two-class classification problem using the class above with varying levels of inter-class variation.

Figure 5-A: Both classes almost completely overlap. As a result, any classifier trained on this data will likely not generalize to more data.

Figure 5-B: Classes have a decent amount of overlap but are clearly separable. Most real-world data will have some form of overlap like this.

Figure 5-C: The data is trivially separable. This type of distribution is uncommon considering noise is an issue in most practical applications.

Note that while we discuss the concept of having a "sufficient" amount of variation, statistically we mean something different. In fact, when considering inter- and intra-class variation, while we want the general variation between images in a class to be "visually sufficient", in practice, the images are still of the same class and thus they should all be relatively contained (that is the general variance of images within the class is not too large). Furthermore, notice that while neural networks are not sensitive to a shift in priors (that is, all classes are weighed equally), an overall class imbalance could lead to misleading results when interpreting a model. This should become obvious with a two-class problem with a severe imbalance. The overall accuracy may be deceptively high, but the class-wise performance will be lacking (as evident usually through some other metric like the BER or F score). These problems are often solved through some sort of weighing on the loss function so as to emphasize particular samples, but that will not be covered here. Just know that models with heavy class imbalances are a practical problem that still occur in current applications of neural networks, and it is the job of the ML engineer to take these issues into account before and during model testing.

Finally, one thing that must always be considered are dataset/model biases. These are biases that are intrinsic to the classes or perhaps intrinsic to the model due to the overall bias in a dataset. These biases often form due to some form of insufficient inter-class variances, but this need not be the case. Consider the following toy problem for an example of the issue. Suppose that as the entry-level engineer for a company, you are tasked with classifying incoming images doctors or nurses. Suppose that the dataset chosen is old, and, as such, doctors are more often male and nurses are more often female. Once the model has been trained and deployed, whenever a female doctor is passed for classification, the system is more likely to respond with an incorrect classification. This is a form of label bias and thus cannot be corrected for without an adjustment of the initial dataset or modification of the model. There are many forms of biases, but data-dependent bias is perhaps the most difficult to correct due to its (more often than not) subtle side-effects. Note that there are some forms of biases that can be mitigated for without the need for further data collection. This is often done by normalizing the data as a preprocessing step, making sure to incorporate augmented data in order to make the dataset more representative, and/or modifying model hyperparameters to reduce the effects of bias. Just remember that not taking into account model and dataset biases can lead to a covariate shift in the original dataset and possibly produce results that are harmful to the end-user.

Dataset Research

While an initial dataset is provided to you (the source of which can be found here), you should be able to understand the differences between each dataset and how training on these datasets may affect the overall model trained on them from above. By this step there is no specific knowledge required of neural networks, but a general understanding of machine learning and how models are generated via datasets will be helpful. For this step, find your own dataset of faces and describe how you would preprocess future images for the purpose of dataset expansion. Explain why you made certain choices, and, if you are unsure of an algorithm that accomplishes a task that is required for preprocessing, then explain it as a black box to the best of your abilities (explain the inputs and outputs in as detailed a manner as possible).

In case you are unsure where to start, take a look at this compiled list of useful datasets for the purpose of performing face recognition tasks. You may notice that there are many constraints that are applied to the datasets collected (or they may be extremely loose with their requirements!). Every dataset is unique so I suggest choosing one and investigating the way it was created in order to make sure your preprocessing logic makes sense. Keep in mind that this isn't the only set of datasets you can use. There are many more lists compiled by researchers that can be considered.

Once you have reasoned out your choice in dataset and preprocessing technique, call the current tutor/TA to sign off for this step. You should also proceed to generate your final training, validation, and testing datasets using the 'fileSplitter.py' script you made.

Data Augmentation (Optional)

Although dataset collection can be made difficult given the context above, there are times when there simply is not enough data that can be collected in order to form a proper dataset. For example, consider a classification network that must be able to distinguish between two different rare marine species. It is also known that these species are symmetric in nature and are more often than not oriented in a specific position along the ocean floor. The only problem is that between both species, you only have approximately fifty example images all taken from a stationary camera. In a situation like this, there seems to be a large problem as new data would be prohibitively difficult to collect, and it is very unlikely that coordination with others would yield a larger number of images. In situations like these, however, data augmentation can significantly improve model performance. Data augmentation is the process of adding "artificial" data to a dataset by modifying available data through class-preserving transformations.

The most important thing to understand is what is meant by a "class-preserving transformation". In general, transformations can take on many forms for images (the domain which we are interested in), but the issue is determining which of these transformations will preserve the class attributes. It is generally possible to augment image-based datasets via rotation, scaling, translation, flipping, and cropping. Which of these (among others) that are chosen will depend on the dataset chosen. For example, a dataset that consists of classifying upright human faces will benefit from selective cropping, scaling, and translation as these augmented images would be admissible for a dataset of this type. However, transformations such as rotations would not make sense as all faces must be upright, and flipping may be inappropriate due to the fact that faces are not symmetric and a mirrored face may not reflect the same person. However, for a dataset that seeks to classify various symmetric plankton, most of these transformations would help as these organisms can be found in almost any orientation! Learning to choose which data augmentation works best for your network is a critical part of designing a neural network for any task.

If you so desire, feel free to generate an augmented version of the dataset you are planning to use for your network. Augmentation can either be performed during training (online) or before training. Generally, augmented datasets are generated before training if sufficient space is available, but online augmentation can be even more expressive given particular datasets and preserve hard disk space when considering extremely large datasets [at the expense of extra computations and possibly unrepresentative data]. TensorFlow supports online data augmentation through its data loader class, but it requires a bit more knowledge on how generators work in conjunction with TF's batching operations. For this project, generating the augmented dataset before training is sufficient.

Image 6: An example of some of the types of perturbations that are applied to images in the augmentation technique. Note that some may not necessarily be relevant to certain classification problems.

Convolutional Neural Netorks (CNN)

The following section covers a significant part of the project. In this section, students will first read up on the general theory behind neural networks. An oral examination will follow after. Students will then build two different neural networks using the Keras functional API. The first network will be built from scratch using the lower level functional API in order to ensure students understand how larger networks are built. The second network built will work off of a pre-built VGG16 network provided by Keras. Training will be performed on this network for the purpose of face recognition.

General Theory & MLP Introduction

Please navigate to the following introduction from the CS231n class before advancing with the following. While the assignment can be completed without prior knowledge of how neural networks work, the intuition behind the process will remain vague and unknown without some background in the subject. You should read over the following parts in order to get a good idea of how neural nets work:

Module 1
- Optimization: Stochastic Gradient Descent
- Backpropagation, Intuitions
- Neural Networks Part 1, 2, 3
Module 2
- CNN: Architectures, Convolution / Pooling Layers
- Transfer Learning & Fine-tuning CNNs

You are free to read the other sections, but they are a bit less important in terms of their relevance to CNNs. Note that although Module 1 covers mostly MLP networks, the ideas easily extend to the CNN scenario as the only difference between the two networks are the way that neurons are connected to one another and how the weights are organized with respect to these neural connections. Note that you may want to be a bit more careful when reading the first part of the Module 1 as it will be the basis for which the rest of the sections are explained. It will be very difficult to follow in the second and third parts of the Neural Networks series without some basic knowledge of gradient descent and how optimization works.

Once you are done, call the TA to begin the small oral examination. There will not be difficult questions about very obscure concepts or applications. Instead, it will focus on some of the basic layers of CNNs and the general neural network creation and training process (backprop/gradient descent/weight organization).

Yann LeCun's LeNet

We will start with an overview of LeCun's LeNet, which was the first "deep" neural network designed for the purpose of recognizing digits from the well-known MNIST dataset. The architecture of the network can be seen as follows:

Figure 7: The above is the layer representation of the LeNet network. This shows one of two different ways of representing the layers through their output dimensions. Most networks used today will likely be shown in a table due to their size, however.

From the introduction of neural nets, it should be fairly obvious what layers comprise this network. The first assignment is to create the LeNet CNN from scratch by simply looking at the output feature maps of each layer. Now, it's quite clear that in a practical application the layers will be given in a table format (or block-table format if extremely large) simply because this type of representation isn't extendible to larger networks; however, it sets up a good practice for being able to understand how the layers affect the input feature spaces.

Recall the following attributes about the input and output dimensions of the feature spaces given particular layers (here we assume there are now batches out of convenience!):

Use the above information in order to build the model in Keras. Consult the Keras API documentation for any questions that you may have regarding functional arguments. If you have more general questions regarding the implementation, feel free to ask a TA for help.

Once you are done with your Keras implementation of LeNet, show a TA the successful compilation of the model and a model summary with dimensions matching output feature layer representations in the image above.

VGG16 - Theory

While LeNet is technically a deep neural network, it is nowhere near as deep as most networks that followed. For example, the ResNet paper presented a network with 1000+ layers to show the benefit of adding residual connections to a neural network. While it would be interesting to implement the ResNet1202 model, we will choose an alternative family of models that has become just as popular as smaller ResNet models for images: the VGG family of networks.

Much of the explanation for VGG16 can be found in the paper this paper by K. Simonyan and A. Zisserman. To save some time, some motivating factors of the paper will now follow. Before VGG, many award-winning models were experimenting with larger kernel sizes placed earlier in the network in an attempt to increase the initial receptive field of the network. The hope was that by including larger convolutions earlier, a better generalization for earlier layers would give later layers a better overall representation. The creators of the VGG models, however, decided to do the exact opposite. Instead of training a small network with larger kernel sizes, they chose to train a deep network with successive small convolutions in order to attain similarly sized output receptive fields at the output of each "block" of convolutions. With non-linear activation functions placed following each convolution, their intuition behind this choice was that a similarly sized 7x7 kernel could be estimated by a series of 3 3x3 convolutions and a 5x5 kernel could similarly be decomposed into a series of 2 3x3 convolutions. The results of doing so on the training of the network were immediately obvious. Where a 7x7 kernel would require 49 parameters to describe, a series of 3 3x3 kernels would only require 27 parameters to describe. Following this logic, the resultant equivalent of a larger kernel network could be represented using this convolution "blocks" with less than half of the required parameters. This would prove to be significant for networks that already required the training of millions of parameters! Furthermore, by forcing the network kernels to be decomposed into smaller kernels, the model architecture naturally functioned as a kernel weight regularization technique.

One final thing I want to add before finishing up the logic behind the model is present in model C in the paper. Note that the model actually uses 1x1 kernels for convolution at the end of a series of two 3x3 convolutions. For those wondering how exactly a 1x1 convolution would do anything, consider looking at the output dimensions and number of channels of the convolution. First of all, it should be clear that the only difference in dimensions comes from the number of channels between the input and output of the convolution layer. The number of channels can be set by the number of kernels desired, and, since the convolution uses only a 1x1 kernel, the output width and height remain the same regardless of the padding method used. If that is the case, then what exactly is the use case for the 1x1 kernel? Basically, the 1x1 kernel serves as a channel pooling layer and increases the non-linearity of the output since an activation function would naturally follow. The VGG paper only uses them for the latter as the number of input and output channels remains the same, but many newer networks take advantage of this depth-wise pooling to manage the number of channels within a block (think of the Inception networks).

We will be focusing in particular on the Keras implementation of VGG16 (Model D in the paper). For convenience, the table of models covered in the paper along with the model to be implemented can be seen below this section. Hopefully, by looking at this table of models, students should be able to explain why the network was blocked off in such a way and explain how this network is better or worse than other networks through a comparison of the difficulty in training and representational power.

Table 1: This table holds all of the initially suggested VGG networks that were tested in the original paper.

VGG16 - Implementation

The next assignment is to implement VGG16 for the purpose of classifying images of people. Instead of having students design the model from scratch, we will instead start with a base model provided by the Keras module that has been pre-trained on the ImageNet dataset. This process of training a new model's feature extractor for the purpose of a new task is known as transfer learning. While the inter-class variability of the people face dataset is not properly described by the ImageNet dataset, the fact that the classifier has been trained on people (and living things in general) is a good starting point at least when compared to a random start. Note that the closer the original problem is to the new problem, the greater the extent to which transfer learning helps.

Using the general knowledge found in the neural networks sub-page and the following Keras API page, create a VGG16 model with a custom classification layer designed for the dataset generated from the previous sections. That is, create a pre-trained VGG16 feature extractor and append two new dense layers along with their proper activation functions in order to satisfy the new classification problem. In between these two dense layers, a dropout layer should be placed to help prevent overfitting. Perhaps the most important thing to note is that a classification problem requires a particular output activation function in order to function. Please review the introduction to neural networks page if you are having trouble understanding what this means.

After the model successfully compiles, train the network on the training dataset generated from the previous section on dataset generation for about five epochs (you may set this to whatever you like, however). While training, make sure that your network prints out the accuracy on the validation set every epoch. As training goes on, there will be a point where the validation accuracy begins to increase at a significantly slower rate than the training accuracy. At this point, the network is likely overfitting the data on the training set and no further significant improvement in the validation accuracy will be observed. Try adjusting the training procedure and/or learning rate to ensure that severe overfitting does not occur. If you are unsure what qualifies as overfitting, don't worry as the original parameters were chosen to only observe minimal to moderate overfitting. Save the model once it has finished training and make sure that you take note of the training/validation accuracy via a plot or table as it will be required in order to check off your work. This page of the Keras FAQ may help you when attempting to save or load the model. Make sure that you save the entire model and not just the weights!

The starter code for this section can be found in /Project_3/train_vgg.py. The method for loading in data should not be modified. If an exception has occurred, it is likely due to the fact that an incorrect input image directory has been used or an incorrect number of classes has been identified. The training accuracy of this step should reach at least 95% depending on the variability of your dataset and how well the previous preprocessing steps were followed. The validation accuracy of the network should also reach a similar degree of accuracy. If your network does not train to such a degree after several attempts, request assistance from a TA as your network may not have been implemented properly.

VGG16 - Evaluation

With a trained VGG16 model at hand, we are now ready to evaluate the "real-world" performance of the model. In essence, this means attempting to classify never seen before images and assessing the overall accuracy of the model. Using the model saved from the previous sub-module and the training dataset, evaluate the performance of the model and show the results to a TA/Tutor. Comment on whether these results are expected or unexpected and why.

You may find the starter code for this section under /Project_3/test_vgg.py.

Flask and Web Server Implementation

This final section will cover the integration of a Flask web server into the entire classification process in order to enable off-site processing and user identification. Most of the Flask implementation will be given to the students; however, some cursory understanding of how the API works will be required in order to create a functioning client and server pair. As part of the original repository, a root page (index.html) is given that allows students to test the server python file separately from the client in order to allow for a more streamlined debugging process. In order to run the server remotely with support for webcam usage, make sure to run the server script in a SSL context by calling the file with the --ssl flag as follows:

python3 server.py --ssl

Server Implementation

The server is based off of a simple Flask python implementation. The actual server is started using a single line while the accessible routes are determined by the functions that are decorated by the app.route decorator. If you recall from the Python introduction section, this function is actually a decorator factory. It returns a decorator that is custom made to allow the function access to data received on a particular route as specified by the argument of app.route. Included in the given server starter code, /Project_3/, is actually an additional root page (index.html), which is automatically served when accessing the root page of the publicly viewable IPv4 address. That is, in order to access the website, it should be done in the following fashion:

http[s]://(Public IPv4 Address):8080/

Anything following the final "/" can be considered a unique link which can be served by the server.

The root page itself can be used as a testing site for the server file. When run using the SSL context as shown in the section above, the site can request webcam access on your laptop/desktop and submit images. Note that since the image sent by the root page will not be in the proper format, it will not provide a valid output from the CNN unless the preprocessing steps are moved from the client file to the server file (this is not required). For this section, complete the server.py file within the /Project_3/ directory with the requested changes and ensure that the root page receives a proper response when sending an image from the webcam.

Client Implementation

The client functions very similarly to the process followed in the initial data collection scheme, but with a couple modifications intended for making data sending easy. The following is the algorithm that should be followed by the client:

Set up the classifier and load the weights for the face cascade.
For every frame received by the PiCamera:
1. Process all possible faces in the PiCamera.
2. If a face is detected, then:
  1. Show the face detected and prompt the user for confirmation of the request.
  2. If user agrees, then for the most likely face:
    1. Send the cropped grayscale face to the server endpoint for classification.
    2. Request the server's prediction for the user's identity
    3. Display the image along with the name and probability value.

This should run indefinitely as the number of frames captured by the PiCamera would generate a constant stream. Furthermore, ensure that the image sent to the server is a square as any non-square images will be force resized by the time the server receives the image. Such warping is likely to cause an incorrect identification as the features the network was trained on will not be preserved. The starter code can be found in the client.py file within the /Project_3/ directory.

Module Testing [Required] & Extension [Optional]

With the server and client files both properly set up, it is now time to connect the system and have each file interface with each other. Make sure to run the client on the Raspberry Pi and the web client on a server with enough compute power to hold the VGG16 model and any temporary variables. Show the TA that your model properly outputs a proper guess given a camera capture of with your face in it.

The only issue with this algorithm is that the model only returns the output for a single person. Adjusting the model to work on multiple users is as simple as adjusting a single loop in the program. If you finished early then try modifying the client algorithm to function to work with multiple face inputs! [When working with multiple faces, it is advisable to remove confidence values and simply place the suggested users above the bounding boxes.]

Alternatively, recall earlier that the server.py file would benefit by having the preprocessing moved to it for the convenience of being able to also take images in the root server page and have proper identification results. Feel free to modify the client.py and server.py files so that the entire image is sent instead of only the face file. Note that there should still be some detected faces for the server to not be overloaded with requests!

Resources

CSE 231n - Great source for pretty much the entire project sequence. Covers all of the theory necessary to build models and understand the process in modules 1 & 2 of the course notes page. Note that much of the introduction is spent on motivating the usage of neural networks over other classifiers (such as SVMs) which you may find useful if you have some basic prior ML knowledge.
Deep Learning by Ian Goodfellow - Perhaps the most useful math-based introduction to neural networks by one of the men who came up with the concept. It also has a small section on state-of-the-art (at least during that time!) networks that are still architectures used in today's networks. If you don't mind the math, this is usually the best place to start understanding neural nets in depth.
Flask API Tutorial - A decent introduction to using the Flask API to host web servers locally. It covers some basic examples along with the thought process behind implementing certain functions.

Page updated

Google Sites

Report abuse