Proposed Detection Face Model by MobileNetV2 Using Asian Data Set

In 2019, the infectious coronavirus disease 2019 (COVID-19) was first reported in Wuhan, China. It has then become a public health problem in the world. This pandemic is having a heavy impact on the lives of people in our country. All countries are trying to control the spread of this disease. To solve the problem, each person needs to wear masks in a public place. Therefore, we propose a model capable of distinguishing between masked and nonmasked faces using a convolutional neural network (CNN) based on deep learning (DL)—MobileNetV2 in this paper. The model can detect people who are not wearing masks. It has an accuracy of up to 99.37%. The model will be applied in places such as schools, offices, and so on to monitor the wearing masks.


Introduction
According to the World Health Organization [1][2][3][4], coronavirus disease 2019 (COVID- 19) is an infectious disease caused by the SARS-CoV-2 virus. e virus can spread from an infected mouth or nose in small liquid particles when they cough, sneeze, speak, sing, or breathe.
Most people infected with the virus will experience mildto-moderate respiratory illness and recover without requiring special treatment. However, some people will become seriously ill and require medical attention. Older people with underlying medical conditions such as cardiovascular disease, diabetes, chronic respiratory disease, or cancer are more likely to develop serious illnesses. Anyone can get sick with COVID-19 and become seriously ill or die at any age. e best way to prevent and slow down transmission is to be well-informed about the disease and how the virus spreads. We can protect ourselves and others from infection by staying at least 2 meters apart from others, wearing a properly tted mask, washing our hands, or using an alcohol-based rub frequently.
Vaccines have been developed. However, they only can relieve symptoms while infecting and cannot prevent the spread of disease. Vietnam has carried out vaccination coverage across the country and is aiming to bring activities back to normal. erefore, wearing a mask is essential to slow down the spread of COVID-19.
However, everyone does this well all time. Many people do not wear masks or wear them the wrong way in public.
is greatly a ects the prevention of disease.
To support the control of mask-wearing in public places, we propose a model that can recognize and distinguish between people who wear and do not wearing masks. e expected model is trained based on deep learning. e trained model can recognize and distinguish face wearing or not wearing masks from input images and videos. e paper has the following three main points: First, a face detector model, Retina Face, was used to detect faces Second, we use the Mask eFace program to create our data set Finally, the MobileNetV2 model is trained on our data set and used to classify whether the input faces are masked or not e rest of the paper is presented as follows. In Section 2, we will present related work. In Sections 3 and 4, we present and evaluate the effectiveness of the proposed model, respectively. Finally, we give a conclusion in Section 5.

Related Work
Since the outbreak of the COVID-19 pandemic, people have been severely affected. is respiratory disease has a rapid spread. Countries around the world have had to apply a lot of measures to prevent it, even lock down the country, and not allow people moving from other countries. Up to now, many vaccines have been developed. ese vaccines work to reduce symptoms and the impact of the disease on people. However, they cannot prevent the infection [5][6][7]. e disease is no longer too dangerous for healthy people with the current high vaccination coverage rate. However, it is still dangerous for underage children and the elderly who have other diseases. All countries are moving toward the reopening of outdoor activities and tourism services to prevent economic decline. Our country is also moving toward normalcy. Everyone can resume activities like before: traveling, going to the office, or going to school, as shown in Figures 1 and 2.
However, we should still follow pandemic prevention regulations to minimize risks. e most strictly recommended measure is wearing a mask. erefore, an automatic system to detect people who are not wearing masks is the subject of much attention to be able to control compliance with mask-wearing wells.
is speed is quite impressive that can be applied in real time. However, the specific accuracy and data set are not published. erefore, we cannot estimate the accuracy of the model. e authors [9] have designed a face mask identification method using the SRCNet classification network and achieved an accuracy of 98.7% in classifying the images into three categories, namely, correct, incorrect, and not wearing face masks. eir frame rate is 10 fps. is model can detect even cases of wearing the wrong mask while being quite accurate. Since the data set used to train the model is aimed at diversity, it does not focus on a single group of objects. e authors [10] have successfully researched a deep learning model for detecting masks over faces in public places. e proposed model efficiently handles varying kinds of occlusions in dense situations by making use of an ensemble of single-and two-stage detectors. e accuracy of the model is 98.2% for mask detection with an average inference time of 0.05 seconds per image. e paper [11][12][13]24] presents the MobileNetV2 deep learning (DL) method to detect mask wearers on real-time images and video. e network is trained to perform two-layer identification of people wearing and without masks. Sample images were obtained from the real-world masked face data set database (RMFRD). e results show that the accuracy of network is 99.22% with 4,591 samples. e above masked face detection models are all trained by some famous masked face data sets such as Real-World Masked Face Dataset (RMFRD), labeled faces in the wild (LFW), and face mask label data set (FMLD). People on each continent have different appearance characteristics: face, hair, skin color, eye color, and so on. In addition, people prefer and use different mask types in each country. erefore, the existing data sets cannot contain all characteristics of all people in the country. e paper proposes a model to classify faces wearing masks with the scope of research in Vietnam and trains on the appropriate data set about the appearance and types of masks. is data set provides the closest features to Vietnamese people. erefore, the trained model will show high accuracy when applied in our country.
In the scope of research in Vietnam, the paper proposes a model to classify faces wearing or without masks and train on the appropriate data set. is data set provides the closest features to Vietnamese people. ereby, it is expected that the model can achieve high accuracy while applied in Vietnam.

Deep Learning.
Artificial intelligence (AI) is a wideranging branch of computer science concerned with building smart machines capable of performing tasks that  typically require human intelligence [25]. Several famous AIs are mentioned, such as Siri, Alexa, and other smart assistants; self-driving cars; conversational bots; e-mail spam filters; and tag recommendations of Facebook.
Machine learning (ML) is a subset of AI that provides systems the ability to automatically learn and improve from experience without being explicitly programmed. In ML, there are different algorithms (e.g., neural networks) that help solve problems.
DL is a subfield of ML concerned with algorithms inspired by the structure and function of the brain called artificial neural networks [26]. It uses multiple layers to progressively extract higher-level features from the raw input. For example, lower layers may identify edges, while higher layers may identify concepts relevant to a human such as digits, letters, or faces. DL algorithms perform a task multiple times to improve the results. ese systems help a computer model filter the input data through layers to predict and classify information. DL processes information in the same manner as the human brain. e architectures of the DL network are classified into convolutional neural networks (CNNs), recurrent neural networks, and recursive neural networks [27].

Artificial Neural Networks (ANNs).
In information technology (IT), an ANN is a system of hardware and/or software patterned after the operation of neurons in the human brain. ANNs are a variety of DL technology that also falls under the umbrella of AI [28]. eir name and structure are inspired by the human brain, mimicking the way that biological neurons signal to one another. It comprises node layers, containing an input layer, one or more hidden layers, and an output layer.
As we can see in Figure 3, each node or artificial neuron connects to another and has an associated weight and threshold. If the output of any individual node is above the specified threshold value, it is activated and sends data to the next layer of the network. Otherwise, no data is passed along to the next layer of the network. Each node has its linear regression model, composed of input data, weights, a bias (or threshold), and an output. e formula is m i�1 where w is weight, X is the input data, and the output is Once an input layer is determined, weights are assigned. ese weights help determine the importance of any given variable with larger ones contributing more significantly to the output compared to other inputs. All inputs are then multiplied by their respective weights and then summed. e output is then passed through an activation function, which determines the output. If the output exceeds a given threshold, its fires (or activates) are passed data to the next layer in the network. As a result, its output becomes the input of the next node. is process of passing data from one to the next layer defines this neural network as a feedforward network.
Activation functions are usually nonlinear functions. One of the most widely used activation functions today is the sigmoid function, shown as follows: e loss function is used to evaluate the accuracy. It is a function that allows determining the degree of deviation of the prediction result from the actual value. If the model predicts more errors, the value of the loss function will be larger. Besides, the more correct predictions, the lower the value of the function. Finally, the model uses a backward propagation algorithm to calculate the gradient of the parameters to find the parameter for the desired neural network.
is method traverses the neural network in the reverse direction from the output to the input. e backward propagation algorithm stores the intermediate variables that are the partial derivatives during the gradient computation over the parameters.

Convolutional Neural Network (CNN).
A CNN/Con-vNet is a class of deep neural networks [10]. CNNs are made up of neurons with learned weights and biases. It uses a special technique called convolution. It reduces the images into a form that is easier to process without losing features that are critical for getting a good prediction.
Each neuron takes several inputs and performs a matrix multiplication and then passes a nonlinear function. e CNN architecture encodes certain properties of the image of the model. is makes forward propagation more efficient to deploy and greatly reduces the number of parameters in the network. CNNs take as input a one-dimensional vector and transform it through a series of hidden layers. Each hidden layer is made up of a set of neurons and fully connected to all neurons in the previous layer. e neurons in the hidden layer operate completely independently and do not share connections.
e final fully connected layer is called the output layer, which represents the probability of the class in the classification problem as shown in Figure 4.

Input layer
Multi hidden layer Output layer Figure 3: An example of ANN architecture [11,12].

Journal of Electrical and Computer Engineering
Unlike a regular neural network (RNN), the layers of a CNN have neurons arranged in three dimensions, namely, width, height, and depth, as shown in Figure 5. For example, the input image in the CIFAR-10 data set is 32 × 32 × 3 [31], and the CNN connects only a small area of the previous layer instead of all neurons like an RNN.
CNN included three main layers, namely, convolutional, pooling, and fully connected layers as shown in Figure 6. ese layers are stacked to form the architecture of the CNN. e numerical and ordering arrangement between these classes will create di erent models suitable for di erent problems. e convolutional layer is the most important layer of CNN. is layer is responsible for performing all calculations. Convolutional layers are often used as the rst layer to extract features of the input image. e result after convolution of the input image and a lter is a matrix called a feature map. Over many convolutional layers, the characteristic tends to decrease in size (width and height) and increase in depth (depth or channels). is is also one reason that helps CNN work well on image recognition problems. Figure 7 illustrates how the convolution layer works with color images.
We will learn di erent features of the image with each di erent lter. In each convolution layer, we will use many lters to learn many attributes of the image. e convolution layer applying K lters will have the output of the layer as a three-dimensional tensor whose size is calculated by In feature maps, an activation function is often used to act on all points, and the size of feature maps does not change when passing through the activation function. In the CNN, the most commonly used activation function is the recti ed linear unit (ReLU), shown as follows: e pooling layer is often used between convolutional layers in CNN architectures. e function of the pooling layer is to reduce the size of the representation space to reduce the number of parameters and computational requirements of the model. It controls the over tting learning of the model. e pooling layer works independently on each depth. Assume the kernel size of the composite layer is K × K. e input of the pooling layer is H × W × D, which is decomposed as a matrix of size H × W. On the area K × K of the matrix, we nd the maximum or the average value of the data and then write it to the output. It has two main types, namely, max and average pooling as shown in Figure 8.
In Figure 9, we can be seen that the fully connected layer works like a normal neural network. e neurons in one layer will connect with all neurons in the next layer. e features of the fully connected layer are extracted by the convolution and pooling layers that are generated as the nal result. For the classi cation problem, the nal fully connected layer will use the softmax activation function to give the output classi cation probability of each class. e basic structure of a CNN usually includes three main parts. e local receptive eld is responsible for separating and ltering data and image information and selecting the image areas with the highest use value. e shared weights and bias layer helps minimize the number of parameters that have the main e ect of this factor in the CNN network. e pooling layer is the last and has the e ect of simplifying the output information.
LeNet-5 network [29,30] is the rst CNN. It is used for image classi cation, which is speci cally numerical classication. is network was used by several banks at that time to recognize handwritten digits on checks. e input image of the network was a grayscale image with an image resolution of 32 × 32 pixels. e network consists of seven layers (two layers (conv + max pooling) and two fully connected layers, and the output is the probability of the softmax function). AlexNet has a similar architecture to LeNet with more layers, lters per layer, and stacked convolutional layers [35]. e network consists of three parts, namely, convolution, max-pooling, and dropout. ey are combined with data enhancement techniques (ReLU activation function) and SGD optimization algorithm for output nonlinearity. ZFNet CNN is a network with a top-5 error of 11.7%. is result was achieved by adjusting the hyperparameters of the AlexNet [11,12] while keeping the architecture of elements constituting a CNN similar to AlexNet with the di erence being the lter size at each convolutional layer. GoogleNet/Inception [31] is a generated CNN. It is developed by Google with a top-5 error of 6.67%. is architecture is inspired by the LeNet that has been implemented with a new network constituent element. e network training process uses batch normalization, image distortion, and the RMSprop optimization algorithm. e inception module is made up of convolutions of small size to minimize many network parameters. VGGNet [36] is ranked second with a top-5 error of 7.3% that includes 16 convolutional layers. LeNet in AlexNet uses conv-max-pooling Input layer Hidden layer1 Hidden layer 2 Output layer Figure 4: An example of RNN architecture [29,30].
width height depth     Journal of Electrical and Computer Engineering architecture and VGG in the middle and end architecture. is leads to a longer computation time. However, features will be retained than using max pooling after each convolution. Residual neural network (ResNet) [37] is developed by Microsoft. e network model has an error rate of 3.57%. It has a structure similar to VGG with many layers making the model deeper. is network is made up of residual blocks that help solve the problem of vanishing gradients allowing it to be easily trained in hundreds of classes.
In addition to the typical network architectures mentioned above, many other CNN architectures have been researched, developed, and applied to other problems. Convolutional neural architectures are increasingly improving both in terms of the number of parameters as well as the accuracy of the network suitable for speci c problems.

Proposing Face Mask Detection.
In the problem of masked face detection, we need to solve two problems. Firstly, we have detected the face in the image/video. Secondly, we detect a mask of a face.
During the analysis and design, the following cases occur: (1) Use a single-stage method for object detection system. (2) Use a two-stage method, one model for detection and one for classi cation. In single-stage method systems, face detection and classi cation are performed simultaneously on a model.
If the model cannot detect faces, classi cation will not be performed. is leads to a lack of face detection. For the twostage method, we use two separate models for detection and classi cation. We will have more options with this method. Changing the model at each stage will give solutions to problems that solve di erent problems.
Much research used other two-stage methods. In our proposed method, the model used for classi cation is trained on our data set. e data sets are created to meet the conditions of applying in Vietnam.
e initial requirement of the problem is to determine who is not wearing a mask in the input image/video. We use two separate models that perform two independent functions, respectively, Retina Face, and MobileNetV2. Previous publicly available research has had varying accuracy on di erent data sets. is is easy to understand since each data set contains images with di erent features. erefore, it is necessary to choose a data set to train the system's model to apply in di erent places. While designing the system, we realized that when using two separate models for detection and classi cation, the accuracy of the system will be increased. However, the processing speed of the whole system will decrease compared to the use of one model at the same time. Considering the practicality of the system, this is not a hindrance. e face detection system of people wearing masks operating in realtime toward applying in large public places such as airports, train stations, shopping malls, and so on is not useful. While identifying these subjects, it is di cult to immediately determine where they are in reality and warn or punish them. erefore, our proposed system is toward a more practical application, which is applying a masked face detection system in o ces and classrooms. ese are places where the people there have known identities. When the system runs, the supervisor can completely know who is not wearing a mask and remind them. We can let the system run itself and save the face images of the violators and then aggregate the violators and apply penalties to them. erefore, low processing speed is no longer an obstacle a ecting the application of the system in practice.
We choose Retina Face to detect faces and MobileNetV2 to classify masked and nonmasked faces.
e proposed system is shown in Figure 10, where Retina Face will take care of face detection, and the area containing the face (ROI) will be cut out. MobileNetV2 will receive the face ROI from the previous step, extract the feature through many layers, and give the nal classi cation result as a mask or nonmask face. e reason we choose Retina Face to detect faces and MobileNetV2 to classify face-wearing or not wearing a mask will be shown in the next sections. Details of the system will also be presented.

Face Detector with the Retina Face Model.
e authors [6] propose one of the rst face detection models based on Cascade-CNN.
is method performs simultaneously on multiscale images and removes background areas from lowresolution images. MTCNN [7] is a method including three stages for face detection. In the rst stage, the P-Net CNN is used to detect the regions that are likely to contain faces and then combined with the NMS algorithm. In the second stage, all image regions obtained in the rst stage will be put into an R-Net CNN to re ne, remove the areas with a low probability of containing faces, and merge the areas with a low probability of containing faces. Finally, the O-Net is used to locate the face and its important points. e model based on the region proposal network (RPN) achieved many successes in object detection that is applied to the face detection problem. In [38], the authors propose a supervised transformer network (STN) model for face detection. In the rst stage, RPN simultaneously predicts face regions along with facial landmarks. e predicted faces are then normalized by mapping face landmarks to standard positions to better normalize the face samples. In the second stage, R-CNN helps verify valid faces. e authors [39] apply the faster RCNN model to the face detection problem. ey achieved the highest accuracy on two large data sets by training the faster RCNN model on the WIDER FACE data set [40]. is is a testament to the important data in building deep learning models. Unlike two-stage detection models such as RCNN, the SSD model detects faces in one stage from the rst layer. In [41], the authors proposed a single-stage headless (SSH) model achieving SOTA with WIDER FACE, FDDB, and Pascal Faces. Instead of relying on image cascades to detect faces with di erent scales, SSH simultaneously detects faces of size from di erent layers during forwarding propagation. e authors propose the model single shot scale-invariant face detector (S3FD) [42] to better detect faces of di erent sizes. Small face detection is a common challenge with anchorbased models. ere are three main contributions to this study. First, a scale-equitable detection model is proposed that can detect faces of di erent proportions. Second, the recall rate when detecting small faces is improved by the anchor matching strategy. ird, the false positive rate when detecting small faces is reduced through background labeling. e authors [43] propose the retina face model-a very popular architecture in face detection.
e main contribution of the study is the manual labeling of ve landmark points on the WIDER FACE data set, which contributes to increased accuracy. e authors [44] propose a simple and e cient TinaFace model using ResNet for feature extraction. Six levels of FPN for multidimensional feature extraction of input images are followed by an inception block for enhancement. A major aim of this work is to demonstrate that there is no gap between the face and object detection. e statistics on the accuracy of face detection are published through scienti c papers on the WIDER FACE data set (WFD) to propose a suitable model selection. e paper is intended for use in public places. erefore, it requires high accuracy of the face detection model. Based on Table 1, the retina face model achieves high accuracy on the WIDER FACE data set. erefore, we choose the retina face as the face detection model. e eld of face detection has been studied for many years; one of the biggest challenges is detecting small, tilted, blurred, and partially obscured faces in the real environment. Retina Face is a face detection model launched by Insight Face in May 2019 to address the above challenges. By manually assigning ve landmark points on the WIDER FACE data set and using the multitasking loss function at launch, Retina Face was the model with the highest accuracy on the WIDER FACE data set.
As shown in Figure 11, the images detected by the Retina Face model are put through ve processing steps: (1) rst using MobileNet or ResNet50 to exploit the backbone feature network, (2) then using the FPN (feature pyramid network) and SSH (single-stage headless) to exploit the advanced feature, (3) next using Class Head, Box Head, and Landmark Head networks to obtain prediction results from the feature, (4) nally decode the prediction results, and (5) remove the duplicate detected values through NMS [41].
During the actual training, the model provides two types of backbone networks: MobileNet and ResNet. e Retina Face model uses the ResNet backbone to detect faces with high accuracy and uses the MobileNet backbone to detect faces with faster speed.
FPN: Detecting small-sized faces is a challenge worth solving to improve accuracy. An FPN is a network model designed based on the pyramid concept to solve this challenge. FPN model (Figure 12) combines the information of the model in the bottom-up direction (bottom-up) combined with the top-down direction (top-down) to determine the face position (while other algorithms only often use bottom-up). When the face feature is transitioned from the bottom, the resolution will decrease, but the semantic value will increase.
During the reconstruction from the upper layer to the lower layer, we will consider the loss of information about the faces. For example, a small face when transitioning to the upper layer will disappear, so the model cannot reconstruct that small face when the feature is forwarded from the upper layer to the opposite. To solve this problem, the model creates skip connections between the  Journal of Electrical and Computer Engineering reconstruction layers and the feature maps that help the prediction process of face locations perform better than information loss. e features extracted from the FPN are fed over the single-stage headless (SSH) network to further extract the important features of the face as shown in Figure 13. e class head determines whether the anchor contains a face or not. e box head locates the face. e landmark head locates ve key points on the face. For each anchor I, the loss function is calculated according to the following formula: Here, L cls is a categorical cross-entropy loss function, p i is the probability that anchors I is a face predicted by the model, p * i is the actual label of anchor I (p * i 1 when anchor i is a face and p * i 0 when it is not facing), L box is the L1smooth loss function of the face position and t i t x , t y , t w , t h and t * i t * x , t * y , t * w , t * h correspond to the coordinates of the face predicted by the model and the actual coordinates assigned by the user.  Figure 12: FPN network structure in Retina Face.
Since the system proposed in Figure 10 prioritizes high accuracy in the face detection step, the paper uses the Retina Face model with the ResNet backbone. is step is essential. If the face is not detected, the classi cation of the masked face will not occur. erefore, we select the Retina Face (ResNet50).

Face Mask Classi cation with the MobileNetV2 Model.
In [47], the authors conduct a comprehensive experimental evaluation of several recent face detectors for their performance on masked-face images. Fifteen models were trained and tested on the face mask label data set (FMLD). e data set is the biggest annotated face mask data set with 63,072 face images. e results are shown in Table 2. e average classi cation accuracy of each model on the FMLD data set together with the prediction speed and the size of the model on disk is arranged in descending order. e processing time is computed over all 12,688 face images and on a per-image basis on an NVIDIA Titan Xp GPU. e prediction accuracy is reported for a bootstrapping protocol with 100 sampled test sets of 5,000 images.
In Table 2, it is easy to see that SqueezeNet v1.1 has the lightest size, fastest speed, and accuracy only 0.8% less than the top model. We need a lightweight model with high speed and acceptable accuracy to compensate for the slow processing speed of the proposed system because of using two models. However, there were several problems with SqueezeNet in our framework. erefore, we choose MobileNetV2-a model with weight and speed ranked third.
is model works perfectly on our device. e authors [48] describe MobileNetV2 to improve the performance of models on multiple tasks and benchmarks as well as across a spectrum of di erent model sizes. e great idea behind the MobileNet model is to replace expensive convolutional layers with depthwise separable convolutional blocks. Each block consists of a 3 × 3 depthwise convolutional layer that lters the input, followed by a 1 × 1 pointwise convolutional layer that combines these ltered values to create new features. It is much faster than the regular convolution with approximately the same result.
MobileNetV1 architecture started with a regular 3 × 3 convolution and was followed by 13  Journal of Electrical and Computer Engineering a 1 × 1 expansion layer in addition to depthwise and pointwise convolutional layers. e pointwise convolutional layer of V2 is known as a high number of channels in a tensor. e bottleneck residual block is a bottleneck. A 1 × 1 expansion convolutional layer will expand the number of channels depending on the expansion factor in the data before going into the depthwise convolution. e block of MobileNetV2 is the residual connection. e residual connection exists to help the flow of gradients through the network. Each layer of MobileNetV2 has batch normalization and the ReLU6 as the activation function. However, the output of the paper layer does not have an activation function. e full MobileNetV2 architecture consists of 17 bottleneck residual blocks in a row followed by a regular 1 × 1 convolution [49], a global average pooling layer, and a classification layer as shown in Table 3.
According to [48,50], the standard convolution takes an h i × w i × d i input tensor L i that applies the convolutional kernel K ∈ R k×k×d i ×d j to produce an h i × w i × d i . Output tensor L j will have the computational cost of w i * d i * d j * k * k. Depthwise separable convolutions are a drop-in replacement for standard convolutional layers. ey work almost as well as regular convolutions, but the cost is only h i * w i * d i (k 2 + d j ), which is the sum of the depthwise and 1 × 1 pointwise convolutions. Effectively depthwise separable convolutions reduce the computation, compared to standard convolutional layers by almost a factor of k 2 . MobileNetV2 uses k � 3 (3 × 3 depthwise separable convolution) since the computational cost is 8 to 9 times smaller than that of standard convolution with only a small reduction in inaccuracy.
We built a customized fully connected layer that contains four sequential layers on top of the MobileNetV2 model. e layers are as follows: (1) Average pooling layer with 7 × 7 weights (2) Linear layer with ReLu activation function (3) Dropout layer (4) Linear layer with softmax activation function with the result of two values  e nal layer softmax function gives the result of two probabilities, each one representing the classi cation of "mask" or "no mask." e nal classi er architecture was shown in Figure 14.

Preprocessing
Data. In the proposed system, we proceed to create a new data set to train the MobileNetV2 model as shown in Figure 15. erefore, the Retina Face detector is trained on the WIDER TRAIN data set. It is a face detection benchmark data set, in which images are selected from the publicly available WIDER data set. It has 32,203 images and labels 393,703 faces with a high degree of variability in scale, poses, and occlusion as depicted in the sample images. is data set is large and diverse enough for many di erent face cases, so we do not intend to improve it further. Our main aim is to create a masked face data set suitable for the classi er.
Since we built the data set ourselves, we faced di culties when it came to training data, such as di erent image resolutions and sizes, null values in the data set, unprocessed labels, and so on. We proceed to preprocess the images in our data set to realize their importance.
Preprocessing step is applied to all raw input images to convert them into clean versions that could be fed to a neural network deep learning model. e preprocessing steps are performed as follows. e images in the data set are divided in a ratio of 6:2:2, corresponding to the training set, validation set, and test set. We resize images to 224 × 224 pixel and convert them to array format. ey are then converted from BGR to RGB color channels, and the pixel intensities are scaled to the range [−1, 1]. en use scikit-learn One-Hot-Encoding to generate a layered label for each image. In this strategy, each output label value vector is converted to a new form, where only 1 output equals "1" corresponding to the classi cation code of the corresponding input vector, and the other outputs are all equal to "0." Finally, we convert the images into NumPy arrays. is step is used not only to preprocess the input data to train the model but also for the proposed system's input images/frames.

Data Set.
ere are many face mask face data sets. Several famous data sets can be mentioned as face mask label data sets (FMLD). e data set is the biggest annotated face mask data set with 63,072 face images. Labeled faces in the wild (LFW) data sets are 13,233 images that are collected from 5,749 people. e real-world masked face data set (RMFD) is large for masking, including 5,000 images wearing masks and 90,000 ones without masks of 525 different people as shown in Figures 16 and 17.
However, none of these data sets are suitable for Vietnamese people. Although the models after training on these data sets can still be applied for our application, they are not completely suitable. erefore, we built the data set ourselves.
e authors [51] use a dlib-based face landmark detector to identify the face tilt and six key features of the face necessary for applying the mask. Based on the face tilt, the corresponding mask template is selected from the library of a mask. e template mask is then transformed based on the six key features to t perfectly to the face. e complete block diagram can be seen below.
e system provides several masks to select from. It is di cult to collect mask data sets under various conditions. It can be used to convert any existing face data set to a masked-face data set. It identi es all faces within an image and applies the user-selected masks to them considering various limitations such as face angle, mask t, lighting conditions, and so on. A single image or an entire directory of images can be used as input to code.
Inspired by [51], we applied it to create our masked data set. We collected the Asian face age data set (AFAD) [52] Input data Step 1: Using face landmark detection Step 2: Estimating mask based on key positions Step 3: Estimating face tilt angle Step 4: Selecting the right template based on face tilt Step 5: Warping mask based on key positions Step 6: Overlaying mask based on adjusting brightness including 164,432 well-labeled images of Asians. e faces in AFAD were passed through the Mask eFace program to obtain di erent masked faces. In the paper, we use four types of masks that are popular in our country surgical, cloth, N95, and K95. Since our model is trained on our laptop, the amount of data that can be used without being out of memory is 8,000 images, 5,000 masked face images, and 3,000 nonmasked face images. In the data set, there are 1,500 images taken from the RMFRD data set to increase the diversity of the data set as shown in Figure 18. Our model is trained on our laptop, and the amount of data that can be used without being out of memory is 8,000 images, 5,000 masked face images, and 3,000 nonmasked face images. In 5,000 masked face images, 3,500 images use a medical mask with the colors white, blue, gray, and black evenly divided. N95, K95, and cloth each are used for 500 images with a basic color. If using 100% of the images generated from the Mask eFace simulator, the data set will be monotonous, lacking realism and diversity. Realizing this problem, the paper took more than 1,500 images from the RMFRD and other data sets to increase the diversity of the data set as shown in Figure 19.
Besides, our data set (in Figure 20) is not large enough to change all weights of the model; we use the transfer learning method to train our model-ne-tuning. It keeps useful weights and changes the weight of several layers of a pretrained model to conform to the target of the paper.

Setup.
In this paper, we use our laptop to build the proposed system. e information on our hardware and software consists of processor Intel ® Core ™ i7-8750H, Ubuntu 18.04.6 LTS, Nvidia GeForce GTX 1050Ti, Cuda 11.2 and Cudnn 8.1, Tensor ow 2.5.0, and Keras 2.4.3.
Since our data set is not large enough to change all weights of the model, we use the transfer learning method to train our model-ne-tuning. It keeps useful weights and changes the weight of several layers of a pretrained model to conform to the target of the paper.

Evaluation Method.
When building a masked face, detection system it needs an evaluation method to be effective for the system, to provide a conclusion and way to improve the system. ere are many ways to evaluate, depending on di erent problems to choose the suitable method. e following method is used in our paper: ACC, TP, TN, FP, and FN. Mask is considered a positive class (P), and no mask is considered a negative class (N). Each video is considered a data point as follows: TP (true positive): an outcome where the model correctly predicts the positive class. Speci cally, the correct prediction number is that the masked face image is predicted as masked. TN (true negative): an outcome where the model correctly predicts the negative class. Speci cally, the  correct prediction number is that the no-mask image is predicted as not masked. FP (false positive): an outcome where the model incorrectly predicts the positive class. Specifically, the correct prediction number is that no mask image is predicted as a mask. FN (false negative): an outcome where the model incorrectly predicts the negative class. Specifically, the correct prediction number is that the mask image is predicted as no mask. ACC (accuracy): the ratio between the number of correct predictions and the total number of data points as follows: 4.4. Results. e whole process of training the Mobile-NetV2 model occurred in 1 hour and 6 minutes with the computation time per epoch being 75 seconds. is speed is fast since the amount of training data is not too large-8,000 face images. However, one of the important things that determine the fast training time with high accuracy is fine-tuning. is training method "inherits" a trained network on a very large data set of generic images to create a specialized model for a more specific task, specifically in this problem distinguishing between wearing and not wearing masks. anks to that, it did not take us too long to do the training. e model was trained on Nvidia GeForce GTX 1050Ti. To perform model training, the selection of parameters is also essential. We set up epoch � 40, batch size � 32, and learning rate � 0.00001. ese are the parameters that after many tests, the model after training achieves the highest accuracy.

Method
Accuracy (%) Frame rate (fps) YOLO version 4 [53] 95.00 -YOLO version 4 [54] 98.00 49.5 MTCNN + SRNet [9] 98.7 10 ResNet50 [55] 98.2 20 Proposal system, Retina Face + MobileNetV2 99.37 3  To evaluate the accuracy of the masked face detection method that the paper is using, different methods are compared. e accuracy of our model is 99.37%. In Figure 21, easy to see that the model works very well with different mask types. When we try it in webcam video on our laptop, it is working perfectly with a rate is 3 fps. e results after the models are trained are shown in Figure 22.
We compared the proposed accuracy and processing speed of the proposal with other systems [9]. e accuracy comparison table is not intuitive since the models are tested on different test data sets. Table 4 shows that the proposed system has impressive accuracy although the speed is not high. is is understandable since the proposed method uses two independent CNN models: e Retina Face model for face detection and MobileNetV2 for classification. However, this does not affect the practical applicability of our system. Instead of having the system run in real time, we can let it run automatically. When masks do not appear, the system crops and saves based on these images to find the violator and sanction. is is entirely possible since the applicable locations are schools, offices, and so on, where people can be controlled. Table 5 shows the details of the layers that we added in terms of output shape and parameters. According to the table, the number of additional parameters of the model is not large (164,226 parameters), combined with the pretrained model MobileNetV2 on ImageNet has a total of 3.4 million parameters [48]. Compared with current CNN models, this can be considered one of the models with very few parameters. erefore, the model has a fast processing speed in the classification stage. erefore, we can see that the processing speed of the whole system is not high because of the face detection stage by Retina Face. However, we get high accuracy in the face detection step.
Besides, the method can be applied to other problems, for example, thermal imaging [56][57][58]. e ventilation hole of the angle grinder will be considered as a mask, and the algorithm needs to build a data set about this system. When we design the full data set, the algorithm can completely determine the exact ventilation hole.
Several experimental results of the masked face detection system are shown in Figures 22-24. As we can be seen in Figures 24 and 22, the masked face detection proposed by the paper works well for different types of masks. In Figure 24, the system works excellently when there are many faces in the image. e Retina Face detection model works fine. e undetectable faces are all hidden faces larger than 80%. Figure 23 shows the result when we run our masked face detection system on a webcam video.
In this section, we present the face mask data sets and built them for the proposed system. e evaluation results are then presented and combined with other methods. Finally, we show the output images.

Conclusion
e paper focuses on studying the use of CNN to detect masked faces. e paper presented network models and deep learning algorithms for face detection and classification based on face Asian. e system can detect and classify face that is wearing or not wearing the mask with up to 99.37% accuracy with Retina Face detection with ResNet50 backbone and face classification with MobileNetV2, and the frame rate is 3 fps. e rate is at an acceptable level. It does not affect the practical application of the proposed system. In the context of preparing to normalize activities, the model is proposed to be applied in schools and office places with fixed identities to control people to follow the rule of wearing masks. Recording faces labeled as not wearing a mask can replace real-time applications.
e system achieves high accuracy. However, it is not optimal for blurred faces. Since the data set is not large enough to train the model, we use more samples of blurred and complex face angles. Although the proposed system received quite positive results, there were still a few test cases that gave bad results. For example, the face that is included in the classification is blurred due to the camera lighting conditions or face tilt angles being too large. erefore, we plan to extend the data set to train the classifier model of the system on more powerful devices in the future. Furthermore, we will improve the system to achieve a higher FPS processing speed while maintaining the same accuracy. We will focus the research to optimize the execution time of the system that is suitable for embedded systems. erefore, it will increase the practicality to be able to apply in practice.

Data Availability
ere are many face mask face data sets. Several famous data sets can be mentioned as face mask label data sets (FMLD). e data sets are in [31] and [32].

Conflicts of Interest
e authors declare that they have no conflicts of interest.