What Does A Face Detection Neural Network Look Like?


In my last post, I explored the Multi-task Cascaded Convolutional Network (MTCNN) model, using it to detect faces with my webcam. In this post, I will examine the structure of the neural network.

The MTCNN model consists of 3 separate networks: the P-Net, the R-Net, and the O-Net:

Image 1: MTCNN Structure // Source

For every image we pass in, the network creates an image pyramid: that is, it creates multiple copies of that image in different sizes.

Image 2: Image Pyramid // Source

In the P-Net, for each scaled image, a 12x12 kernel runs through the image, searching for a face. In the image below, the red square represents the kernel, which slowly moves across and down the image, searching for a face.

Image 3: 12x12 kernel in the top right corner. After scanning this corner, it shifts sideways (or downwards) by 1 pixel, and continues doing that until it has gone through the entire image.

Within each of these 12x12 kernels, 3 convolutions are run through (If you don’t know what convolutions are, check out my other article or this site) with 3x3 kernels. After every convolution layer, a prelu layer is implemented (when you multiply every negative pixel with a certain number ‘alpha’. ‘Alpha’ is to be determined through training). In addition, a maxpool layer is put in after the first prelu layer(maxpool takes out every other pixel, leaving only the largest one in the vicinity).

Image 4: Max-pool // Source

After the third convolution layer, the network splits into two layers. The activations from the third layer are passed to two separate convolution layers, and a softmax layer after one of those convolution layers (softmax assigns decimal probabilities to every result, and the probabilities add up to 1. In this case, it outputs 2 probabilities: the probability that there is a face in the area and the probability that there isn’t a face).

Image 5: P-Net

Convolution 4–1 outputs the probability of a face being in each bounding box, and convolution 4–2 outputs the coordinates of the bounding boxes.

Taking a look at mtcnn.py will show you the structure of P-Net:

class PNet(Network):def _config(self):layer_factory = LayerFactory(self)layer_factory.new_feed(name='data', layer_shape=(None, None, None, 3))layer_factory.new_conv(name='conv1', kernel_size=(3, 3), channels_output=10, stride_size=(1, 1),padding='VALID', relu=False)layer_factory.new_prelu(name='prelu1')layer_factory.new_max_pool(name='pool1', kernel_size=(2, 2), stride_size=(2, 2))layer_factory.new_conv(name='conv2', kernel_size=(3, 3), channels_output=16, stride_size=(1, 1),padding='VALID', relu=False)layer_factory.new_prelu(name='prelu2')layer_factory.new_conv(name='conv3', kernel_size=(3, 3), channels_output=32, stride_size=(1, 1),padding='VALID', relu=False)layer_factory.new_prelu(name='prelu3')layer_factory.new_conv(name='conv4-1', kernel_size=(1, 1), channels_output=2, stride_size=(1, 1), relu=False)layer_factory.new_softmax(name='prob1', axis=3)layer_factory.new_conv(name='conv4-2', kernel_size=(1, 1), channels_output=4, stride_size=(1, 1),input_layer_name='prelu3', relu=False)

R-Net has a similar structure, but with even more layers. It takes the P-Net bounding boxes as its inputs, and refines its coordinates.

Image 6: R-Net

Similarly, R-Net splits into two layers in the end, giving out two outputs: the coordinates of the new bounding boxes and the machine’s confidence in each bounding box. Again, mtcnn.py includes the structure of R-Net:

class RNet(Network):def _config(self):layer_factory = LayerFactory(self)layer_factory.new_feed(name='data', layer_shape=(None, 24, 24, 3))layer_factory.new_conv(name='conv1', kernel_size=(3, 3), channels_output=28, stride_size=(1, 1),padding='VALID', relu=False)layer_factory.new_prelu(name='prelu1')layer_factory.new_max_pool(name='pool1', kernel_size=(3, 3), stride_size=(2, 2))layer_factory.new_conv(name='conv2', kernel_size=(3, 3), channels_output=48, stride_size=(1, 1),padding='VALID', relu=False)layer_factory.new_prelu(name='prelu2')layer_factory.new_max_pool(name='pool2', kernel_size=(3, 3), stride_size=(2, 2), padding='VALID')layer_factory.new_conv(name='conv3', kernel_size=(2, 2), channels_output=64, stride_size=(1, 1),padding='VALID', relu=False)layer_factory.new_prelu(name='prelu3')layer_factory.new_fully_connected(name='fc1', output_count=128, relu=False)  layer_factory.new_prelu(name='prelu4')layer_factory.new_fully_connected(name='fc2-1', output_count=2, relu=False)  layer_factory.new_softmax(name='prob1', axis=1)layer_factory.new_fully_connected(name='fc2-2', output_count=4, relu=False, input_layer_name='prelu4')

Finally, O-Net takes the R-Net bounding boxes as inputs and marks down the coordinates of facial landmarks.

Image 7: O-Net

O-Net splits into 3 layers in the end, giving out 3 different outputs: the probability of a face being in the box, the coordinates of the bounding box, and the coordinates of the facial landmarks (locations of the eyes, nose, and mouth). Here’s the code for O-Net:

class ONet(Network):
def _config(self):
layer_factory = LayerFactory(self)
layer_factory.new_feed(name='data', layer_shape=(None, 48, 48, 3))
layer_factory.new_conv(name='conv1', kernel_size=(3, 3), channels_output=32, stride_size=(1, 1),
padding='VALID', relu=False)
layer_factory.new_prelu(name='prelu1')
layer_factory.new_max_pool(name='pool1', kernel_size=(3, 3), stride_size=(2, 2))
layer_factory.new_conv(name='conv2', kernel_size=(3, 3), channels_output=64, stride_size=(1, 1),
padding='VALID', relu=False)
layer_factory.new_prelu(name='prelu2')
layer_factory.new_max_pool(name='pool2', kernel_size=(3, 3), stride_size=(2, 2), padding='VALID')
layer_factory.new_conv(name='conv3', kernel_size=(3, 3), channels_output=64, stride_size=(1, 1),
padding='VALID', relu=False)
layer_factory.new_prelu(name='prelu3')
layer_factory.new_max_pool(name='pool3', kernel_size=(2, 2), stride_size=(2, 2))
layer_factory.new_conv(name='conv4', kernel_size=(2, 2), channels_output=128, stride_size=(1, 1),
padding='VALID', relu=False)
layer_factory.new_prelu(name='prelu4')
layer_factory.new_fully_connected(name='fc1', output_count=256, relu=False)
layer_factory.new_prelu(name='prelu5')
layer_factory.new_fully_connected(name='fc2-1', output_count=2, relu=False)
layer_factory.new_softmax(name='prob1', axis=1)
layer_factory.new_fully_connected(name='fc2-2', output_count=4, relu=False, input_layer_name='prelu5')
layer_factory.new_fully_connected(name='fc2-3', output_count=10, relu=False, input_layer_name='prelu5')

Note all the code for P-Net, R-Net, and O-Net all import a class named “LayerFactory”. In essence, LayerFactory is a class — created by the makers of this model — to generate layers with specific settings. For more information, you can check out layer_factory.py.

Click here to read about implementing the MTCNN model!

Click here to read about how the MTCNN model works!

Download the MTCNN paper and resources here:

Github download: https://github.com/vaibhavhariaramani/mtcnn

Research article: http://arxiv.org/abs/1604.02878