Facial Mask Detection Using Image Processing with Deep Learning

,


Introduction
COVID-19, a novel pandemic, posed a severe human pandemic threat to the entire world.is pandemic started in China at the end of 2019 and is probably one of the world's most signi cant health challenges this century.By January 2022, over 300 million con rmed COVID-19 cases were recorded, with almost 5.47 million deaths worldwide.Some countries' scientists enabled breaking through the vaccine in time.
e research is ongoing by thousands of scientists worldwide to better understand how new virus mutations and variants like alpha, beta, gamma, delta, and omicron a ect the e ectiveness of the di erent COVID-19 vaccines.
e COVID-19 pandemic has altered people's entire lifestyles.e patient has an acute lung infection for which there is currently no high-performance vaccine or treatment.
Deaths are rapidly increasing, putting strain on each nation's healthcare systems [1].
Face masks have become an essential part of human life, and the only protection is the use of face masks while interacting with public people.In this challenging pandemic condition, numerous countries have mandated that citizens wear face masks when visiting any intense public spot like shopping malls taking a meal.
e customer care service provider provides services to only those customers who wear a face mask correctly.
e World Health Organization (WHO) has released recommendations and guidelines for the general public and healthcare professionals who use face masks.Face masks, according to medical o cers, give su cient protection against respiratory diseases.Nurses and doctors widely used face masks as the central part of droplet safety measures and precautions.Some argue that wearing a face mask will not protect you from COVID-19 infection.However, there is a significant distinction between absence and the absence of evidence [2].
We presented an accurate and efficient face mask detector algorithm in the present study.Our main task is to check whether or not people wear masks and stay away from public places using the proposed algorithm.It is a classification or object detection problem of two different classes wearing masks and not proper masks.
ere are several methods to detect objects, but we chose to utilize faster RCNN because of its fast, simple, accurate, and precise algorithm.We need to develop a system that could detect faces in this real world and recognize whether or not the detected faces have masks.
e paper is organized in the following fashion: Section 2 presents the literature review, and Section 3 represents the method's description.Section 4 is spare for the results we achieve in the present study, and the conclusion is given in Section 5.

Literature Review
Presently, the issue is connected to the general recognition of objects using deep learning and the detection of object classes [3].A few researchers have been found to detect facial masks based on image analysis in the literature.Detection of the face or mask is one of the classes or groups of objects [4].Detectors depend on deep learning structures rather than handcrafted features and have, in recent years, had outstanding performance due to their exceptional extraction robustness and capability.Face and object detection applications are utilized in education surveillance, autonomous driving, and several other fields [5,6].A surveillance camera's image processing technology could detect a person's face when not wearing a face mask.
Schneiderman and Kanade [7] proposed face scanners with characteristics shaped by a series of feature vectors trained using a view-based approach.e structure has been disclosed to enhance profile face detection accuracy.A detailed collection of features, similar to Haar, was projected, with rectangular features rotated to 45 degrees by Lienhart and Maydt [8].He added another wing with Haar-like elements and a flexible span spatially separated the rectangles.Two separate neural networks were used to detect faces within plane rotations, as suggested by Torralba et al. [9].Hotta [10] showed a support vector machine (SVM) method, local kernel-based, for face recognition, which was superior to global kernel-based SVM in recognizing impeded frontal faces.Felzenszwalb and Huttenlocher [11] presented a deformable design incorporating several object components, as Fischler and Elschlager's visual structure illustration indicated.Lin and Liu [12] proposed that the multiview face identifier be learned as a single tumble classifier.ey built MBH Boost, a multiclass boost-up algorithm, distributing features into several classes.
Goldmann et al. [13] used the qualified detector divided into subclassifiers connected to several predefined image regions.e inputs of subclassifiers were fused, resulting in an updated Viola-Jones detection algorithm.Yang et al. [14] used the first few cascade levels, including all face markers, to estimate the pose for expediency in multiview face recognition, where all the face identifiers modified to different visions have to be measured for each scan window; they used the first few cascade levels along with all face identifiers to approximate the pose for expediency.A quick bounding box estimation method for face recognition proposed by Subburaman predicts the bounding box using a small patchbased local search [15].Mesphil et al. [16] proposed convolutional networks.
ese were neural networks with at least one layer that uses convolution instead of general matrix multiplication.Zhu and Ramanan [17] presented an idea to use the deformable parts-based template to detect a face jointly, measure an estimated pose, and then localize a face sign in the wild, which was later improved to coalesce the landmark approximation and image recognition tasks in a shared supervised way to enhance face recognition through unique landmark detections.
Yang et al. [18] explored channel features to recognize faces that perform well.Despite the ease of using these techniques for unregulated face recognition, the accuracy rate is still inadequate, particularly when the identifier must account for minor false alarms.Girshick et al. [19] proposed a work of inspiration RCNN, a convolutional neural network that performs a selective search to identify candidate regions containing items.
e system aims to identify healthcare workers losing their surgical masks in the operating room.Ren et al. [20] have developed a real-time face recognition and monitoring technique.
Farfade et al. [21] have developed a deep learning-based face detection technique known as deep dense face recognition technology.e method does not require any clarification of landmarks or poses, and it can detect faces in a wide variety of orientations with only one model.Zhu and Ramanan [22] gave a method for dealing with occlusions and arbitrary pose changes in direct face detection.ere is a new factor called normalized pixel difference.Machine learning approaches were used to create a deep transfer, hybrid direct instruction for face mask identification by Redmon and Farhadi [23].Dong et al. [24] have proposed a deep cascaded region detection that investigates its bounding box decrease, a localization method, to achieve image recognition of potential countenances.
Sun et al. [25] enhanced the quicker RCNN technique via profoundly learning face detection algorithms.ey utilized numerous strategies, including a pretraining model, multiband training, passive extraction, accurate calibration of primary parameters, and job grouping.Zhao et al. [3] proposed the surgical mask presence or absence monitoring device in the operating room.In the linguistic image segmentation for facial mask detection, gradient descent is used for preparation, while multiple linear regression cross-entropy is used for neural networks.Ejaz et al. [26] built a new method for detecting the existence of a face mask.ey classified three different types of face mask usage: proper face mask-wearing, wrong face mask-wearing, and no mask.
e two most well-known classes, two-stage human detectors, and single-stage human detectors were recently used [27].Fan and Jiang [28] suggested the inception 2 Mathematical Problems in Engineering network that helps to find out which kernel combination is the best.e Residual Network (ResNet) trains even deeper neural networks to learn an identity function from the preceding stage.
Most recently, the RetinaFaceMask one-stage detector approach has been studied by Fan and Jiang [28].ey fused high-level semantic fusion using multiple feature maps like Feature Pyramid Network (FPN).e proposed algorithm rejects the low confidence predictions and the high intersection of the union.
e deep learning-based approaches have an inherently high degree of accuracy as compared to the other machine learning-based techniques, especially for classification and clustering.e multilayer structure in the network helps to process different tasks at different layers exclusively.

Deep Neural Network
Algorithm.DNN algorithm moves data through a sequence of "layers" of neural network models, with each layer passing a simplified summary of the data to the next layer.Several computer vision algorithms work well on datasets with a few hundred features or columns.An unstructured dataset, such as one extracted from an image, on the other hand, contains so many features that this method becomes inefficient or impossible.Traditional machine learning algorithms cannot handle 2.4 million parts in a single 800 × 1000 pixel RGB color image.
As the image passes through each neural network layer, DNN algorithms learn more about it.Initial layers learn how to detect low-level features such as edges, and later layers incorporate these features into a more comprehensive representation.For instance, a middle layer would detect edges to detect parts of an object in an image, such as a leg or a branch, while a deep layer might identify the entire object, such as a dog or a tree.You gather data from observations and integrate it into a single layer.e layer produces an output, which becomes the input for the following layer, and so on.is loops until the final output signal is received [29].

Types of Algorithms.
ere are several different types of feature extraction algorithms, which can be classified into two categories.

Algorithms at Rely on Classification.
e regions of interest are chosen in the first stage.After that, convolutional neural networks (CNN) are used to categorize specific areas.Since prediction must be run for each selected field, this solution may be prolonged.is group includes algorithms such as the fast RCNN and faster RCNN which are enhanced variants of the region-based Convolutional Neural Network (RCNN) [30].

Algorithms at Rely on Regression.
In contrast to the previous approach, algorithms in this category predict the class probability and define the bounding boxes surrounding the object of interest in a single run from the entire image point of view. is group includes algorithms like You Just Look Once (YOLO) and Single Shot Multibox Detector (SSD) [30].Deep learning and computer vision are used in various applications such as objection detection, medical image analysis, and action recognition [31,32].Recent research is focused on the use of mid-level features and deep learning models to build robust decision support systems and IoT applications [33][34][35].

Faster RCNN.
In object classification and recognition, a deep learning technique known as area of interest polling is gaining much attention.Detecting objects from an image scene containing several things is one example.e goal is to extract fixed-size feature maps using maximum pooling on the entire picture as reflected in Figure 1. e object detection technique used by faster RCNN is divided into three stages.

Region Proposal Network.
Finding the spaces in the given input image where there is a possibility of finding an object is straightforward.e position of an entity in an image can be determined.e area where there is a possibility of finding an object is surrounded by the Region of Interest (ROI).

Classification.
e next step is to assign corresponding classes to the regions of interest defined in the previous actions.Here, the CNN approach is used (Figure 2).e proposed approach includes a detailed process for identifying all spaces of object location in an image.If no regions are placed in the first stage of the algorithm, there is no need to move on to the second step.In 2015, Girschik [36] proposed the Region Proposal Network (RPN) and ROI pooling as a DLA-based object detection solution.ROI can achieve speed and usability for both training and research performance.e ROI layers take a feature map as input, which is the output of a convolution neural network with multiple convolution layers and max-pooling layers.
An N × N matrix is generated by dividing the function map space into regions of interest.e ROI is denoted by the letter N. e first column represents the image's index.In contrast, the second column, which ranges from the upper left-most coordinate to the bottom-most coordinate, represents the ROI coordinates.Region Proposal refers to the determined area of interest space.e system divides the area proposal's entire room into equal-sized partitions.e number of sections in which the whole area proposal is divided must equal the output dimension.e maximum value of each divided subregion is estimated.e maximum values are copied to the output buffer.
In RPN, the image is first fed into the convolution neural network.
e input image is passed through a series of convolution layers before being sent to the final layer, which creates feature maps.Every portion of the function maps includes a sliding window.
e mask size for a sliding window mask is typical.
e anchors for each sliding window are created.Let it be the exact center for these Mathematical Problems in Engineering anchors (x, c, y, c).However, the aspect ratios and scaling factors of the anchors produced will differ.
In addition, a value q is determined for each of these anchors, representing the likelihood of the anchors overlapping the region's boundary surrounding the objects.A region boundary with loc coordinates is the regressor's production.
e classification shows whether the area contains an object or not by a probability of 0 or 1. ( w a , h a , x a , y a are the widths, height, and center of anchor, and h * , w * , x * , y * are the ground truth bounding box height, width, and center.Over the performance from the classification and regression networks, the loss function is established.
Finally, the size 3 × 3 final features are extracted and fed into the networks for regression and classification.
Follow these steps to build your object detection classifier.

Inception V2 of Faster RCNN.
For object detection, the faster RCNN network is a single, centralized system.It employs the area proposal network (RPN) module.It directs the unified network's quest.On the other hand, inception comprises a 22-layer inception module with no ultimately linked layers.e main advantage of this model is that it allows better use of the computational resources available on the network.e inception module functions as a network within a network, piling modules on top of one another.It has 5 million parameters, which is a factor of 12 less than AlexNet.e combination of faster RCNN and inception V2 is computationally expensive, but the results in object detection are more reliable [37].

Compute Unified Device Architecture (CUDA).
CUDA is an NVidia technology that can perform a variety of challenging computations on the GPU.Every thread in CUDA uses kernels executed n times, and a unique number marks each line.CUDA's architecture comprises grids that are subdivided into smaller units called blocks.Each block is assigned to a multiprocessor by the hardware, which has a group of multiprocessors.Finally, threads make up blocks.ese tiniest units can be synchronized together in a single block [38].
In general, the CUDA program begins with computer memory allocation while data on the host is being prepared.e data is then moved from the host to the computer.Since copying data from the host to the computer and the device takes time, it is essential to restrict the amount of data sent.It is possible to launch kernels after the data on the system has been prepared.e results are copied back to the host after the calculation is completed.Finally, the results can be viewed, and the reserved memory can be freed.
e GPU implementation resembles that of a multithreaded CPU program.e concept remains the same.We only copy to the system what is required, such as integral images, qualified classifiers, and detection windows.A CUDA kernel is run for each detection window's size.e program's first version calculates the locations of detection windows in the client framework.A set of identically sized windows is computed.en it is sent to the computer, where the current window detection process will begin.Of course, the detection window is the same size, but the thread index determines its location.e findings are sent back to the host, and information about new detection windows is prepared after the last detection window in the kernel is checked.is procedure is repeated until the scale achieves the desired outcome.e transmission between the client and the computer is sluggish, suggesting a minor change.A count of detection windows and their size is computed for the next iteration on the client-side.
is data is transmitted to the computer.Based on the data obtained from the client and the thread index, it is now possible to calculate the location of a current detection window.is adjustment resulted in a 15-fold increase in detection speed [38].

CUDA Deep Neural Network (CuDNN). CuDNN is a GPU-based deep learning library created by NVIDIA.
CuDNN is used by many machines learning systems, including Caffe, Tensor-Flow, and Chainer, to boost performance.We assume that the program is written in C++ and that it calls cuDNN and CUDA library functions directly in this study.For CNN computation, cuDNN includes several library functions.e cuDNN Convolution Forward function, for example, performs convolution, the cuDNN.Add Tensor function introduces biases, and the cuDNN Activation Forward function triggers sheet.CuDNN parts may only use data from GPU memory for input and output.To use cuDNN, all data, such as feature maps and weight filters, must first be loaded into GPU memory.Make sure CUDA and cuDNN versions are compatible with our Tensor-Flow edition.We should not have to worry about it because Anaconda will install the required versions of CUDA and cuDNN for the Tensor-Flow version you are using [39].

Dataset.
ere were a total of 3694 photos in the dataset.Another choice is to save 20% of the images (730 images) in the test folder and 80% of the images (2964 images) in the train folder, as displayed in Table 1.

Generate Training Data.
Tensor-Flow needs hundreds of images of an object to train a successful detection classifier.e images used in training for a robust classifier should include random items and the target objects and several backgrounds and lighting conditions.Some photographs should have the target object partly blurred, overlapped with something else, or just halfway visible.
After we have collected all of the images, it is time to mark the items in each one (Figure 3).e tool "LabelImg" is used to mark files.Each image's label data is saved in .xmlformat by labeling.ese .xmlfiles will be used to generate TFRecords, a Tensor-Flow input.Once you have labeled and saved each image, there will be one .xmlfile in the test and train directories for each.Now that the images have been called, it is time to make the TFRecords that will serve as input data for the Tensor-Flow training model.e image.xmldata will be used to create .csvfiles containing all the data for the train and test images.for about 40,000 steps or two hours or until the loss is consistently less than 0.05, depending on CPU and GPU. e algorithm is similar to the RCNN algorithm.Since you do not have to feed the convolutional neural network 2000 region proposals every time, "fast RCNN" is faster than RCNN.Instead, the convolution operation, performed only once per image, produces a feature map.It assists in understanding the interdependencies between operations, how weights are calculated, displaying the loss function, and much more.When you combine all of these pieces of knowledge, you have a powerful tool for debugging and improving the model.A vital graph is the loss graph, which depicts the classifier's overall loss over time.

Tensor-Board Losses Graph.
While the curve continues to get closer to zero as time goes by, it will never hit that point because nothing is perfect or faultless.A total loss value of less than 2.5 is generally considered reliable, but it also indicates that the model may be improved by adjusting parameters or having a better dataset.After 200,000 epochs, the total loss is 0.0503.It tends to decrease.However, the map (mean average precision) does not increase as the loss decreases.When the number of epochs is 200,000, the map is 0.0502.e whole training process is reflected below through key graphs acquired after training.e objectness loss was found zero asymptotically after 120 k epochs (Figure 4).e classification loss was notably substantial during the early phase of training but ultimately comes close to zero asymptotically after 100 k epochs (Figure 5).Similarly, consistent asymptotic behavior is reflected in other

Python Shell.
Open the Anaconda command prompt and type "idle" (with virtual environment selected) followed by ENTER to run any scripts.is will start IDLE, allowing us to open and run some of the scripts.Image Object Detection with Tensor-flow Classifier will open when we open the Python Shell.After run module was selected.

Output Results.
In this research, faster RCNN proved to be more efficient and precise in providing 97% accurate results and showed that the processing time for the whole process is less than the other traditional techniques. is study proposed to decide which person is wearing the mask correctly using DNN.When we train our model after 200,000 epochs, the total loss is 0.0503, which tends to decrease.However, the mean average precision does not increase as the loss decreases.When the number of epochs is 200,000, the mean average accuracy is 0.0502.Sample results are shown in Figures 10 and 11.

Conclusion
In this article, we introduced a reliable DNN-based system for identifying people wearing masks.Faster RCNN was employed to train the data in this method, resulting in high accuracy.is model is trained on a GPU to obtain a low computational cost.To achieve our goals, we used a multiphase detection model: First, to label the face mask, and second to detect the edge and compute edge projection for the chosen face region within the face mask.e current findings revealed that faster RCNN was efficient and precise, giving 97% accuracy.e overall loss after 200,000 epochs is 0.0503, with a trend to decrease.While the loss is decreasing, we are getting more accurate results.As a result, the faster RCNN technique effectively identifies whether a person is wearing face masks or not, and the training period was decreased with better accuracy.In the future, Deep Neural Network (DNN) might be used first to train the data and then compress the dimensions of the input to run it on low-powered devices, resulting in a lower computational cost.e results achieved from the proposed approach reflect significant accuracy as compared to the other commonly used approaches, i.e., Table 2.

Data Availability
No such private data were used to support the findings of the study.Only publicly available data has been used.

3 . 7 .
Type the following in the Anaconda command prompt.Train record and test record are the two files used to train the current object detection classifier.e label map is class names to class ID numbers that tell the trainer what kind of object they are dealing with.In a text editor, create a new file named labelmap.pbtxtand save it in the training folder.e numbers in the label map IDs must match those in the generate tfrecord.pyformat.e object detection training pipeline, last but not least, must be set up. e training process decides the model and parameters that will be used.is is the final step before starting running training.e faster RCNN inception v2 pets.Config file has been improved with the addition of file paths to the training data and an increase in the number of classes and examples.Save the file after you have made your changes.e training role has been developed and is ready to begin!Execute the Training.If all is set up correctly, Tensor-Flow will begin the training.e initialization process will take up to 30 seconds before the actual training begins.It will look like this when you first start training.e loss is recorded at each stage of training.It will begin high and progressively decrease as training progresses.Our faster RCNN inception V2 model training started at around 3.0 and quickly dropped below 0.8.Enable the model to train

4. 1 .
Training Results.When training is completed, Tensorboard keeps track of this operation.Using the tool made it possible to decide whether the model is ready for deployment or requires additional training or other modifications.It was possible to visualize the model's learning curve using graphs such as total loss.For example, suppose the error rate remains high and constant over time.Either the model's configuration or the data itself should be updated and corrected, and the training should be terminated.

4. 2 .
Tensor-Board.rough Tensor-Board, we will see how the training has progressed.Tensor-Board is responsible for the visualization graphs.One of the essential graphs is the losses graph, which depicts the cumulative losses of the trained model over time during training.For a GPU-enabled OS, model training took three days.For our faster RCNN inception V2 Coco API training, the loss of the neural network (net) started at five and quickly fell below.Tensorboard is the user interface for visualizing the graph and other resources for debugging, optimizing, and understanding the model.e number of epochs is represented by the x-axis, while the y-axis represents the time.e recognition rate is calculated in real time as part of our model training.After 200 k epochs, we can see that we have achieved our desired accuracy.e model's accuracy was improved by data or image augmentation.e panel has several tabs, each corresponding to the level of data you enter while running the model.Scalars: During the model training, show a variety of valuable data.Graphs: Display the model.Histogram: A histogram can be used to show the weights.Distribution: Show how the weights are distributed.Projector: Show T-SNE algorithm and Principal Component Analysis.For dimensionality reduction, this technique is used.

Table 1 :
Facial dataset of people with/without wearing a mask.

Table 2 :
Comparison of accuracy achieved through different approaches.