Our Approaches

First Function: Mask Detection

In our mask detection system, we can determine whether a person wears a mask. In the following pictures [1] , figures 1 & 2 we would identify as "yes", i.e., wearing a mask, while for figures 3 & 4, we would judge them as "no".

Figure 1

Figure 2

Figure 3

Figure 4

The main tool we use here is the neural network. The database we used is from github [2] , and the architecture we use is MobileNetV2 [3] , proposed by Sandler et al. We will introduce this architecture in the following section.

MobileNetV2

Currently, there are a lot of different artificial neural network designs to recognize images, such as AlexNet, VGGNext, and ResNet. However, this neural network architecture needs a lot of operations and memory to run. Our goal is to design a mask-detection system for mobile devices, it should be light and run fast. Thus, we choose to use MobileNetV2 [3] in our project to train a model for mask detection.

MobileNetV2 is a network that can significantly decrease the number of operations and memory needed while retaining the same accuracy as the different networks mentioned above to recognize images. This contributed to their novel layer module: the inverted residual with a linear bottleneck. “This module takes as an input a low-dimensional compressed representation which is first expanded to high dimension and filtered with a lightweight depthwise convolution.” [3]


table 1: MobileNetV2 [3]

In table 1, Each line describes a sequence of 1 or more identical (modulo stride) layers, repeated n times. All layers in the same sequence have the same number c of output channels. The first layer of each sequence has a stride s and all others use stride 1. All spatial convolutions use 3 × 3 kernels.

Figure 1: convolutional blocks of MobileNetV2 [3]

Figure 5: convolutional blocks of MobileNetV2[3]

table 2: The max number of channels/memory (in KB) that needs to be materialized at each spatial resolution for different architectures.[3]

In table 2, we can see the MobileNetV2 needs less memory than other different architectures. It only needs 400K memory.

figure 6: Inverted Residual Block[3]

figure 7: Bottleneck with expansion layer[3]

In figure 6 and figure 7 they are the module used in this network. This module can be efficiently implemented using standard operations in any modern framework and allows our models to beat state-of-the-art along multiple performance points using standard benchmarks. Furthermore, this convolutional module is particularly suitable for mobile designs, because it allows to significantly reduce the memory footprint needed during inference by never fully materializing large intermediate tensors. This reduces the need for main memory access in many embedded hardware designs, that provide small amounts of very fast software-controlled cache memory. [3]

function 1 [3]

In this function, The inverted residual bottleneck layers allow a particularly memory efficient implementation which is very important for mobile applications. A standard efficient implementation of inference that uses for instance TensorFlowor Caffe builds a directed acyclic compute hypergraph G, consisting of edges representing the operations and nodes representing tensors of intermediate computation. The computation is scheduled to minimize the total number of tensors that need to be stored in memory. In the most general case, it searches over all plausible computation orders Σ(G) and picks the one that minimizes where R(i, π, G) is the list of intermediate tensors that are connected to any of πi . . . πn nodes, |A| represents the size of the tensor A and size(i) is the total amount of memory needed for internal storage during operation i. [3]

Second Function: Effectiveness Check

Besides the function of mask detection, our system can tell whether a person wears masks correctly. In the next three pictures [4] , we will identify figure 8 as an example of wearing properly, while the other two are not effective.

Figure 8

Figure 9

Figure 10

To achieve this function, we use the technique of facial landmark identification. We use the tool of Haar Cascades [5] to detect whether the nose is covered.