Face Detection and Recognition Based on Visual Attention Mechanism Guidance Model in Unrestricted Posture

Performance of face detection and recognition is aﬀected and damaged because occlusion often leads to missed detection. To reduce the recognition accuracy caused by facial occlusion and enhance the accuracy of face detection, a visual attention mechanism guidance model is proposed in this paper, which uses the visual attention mechanism to guide the model highlight the visible area of the occluded face; the face detection problem is simpliﬁed into the high-level semantic feature detection problem through the improved analytical network, and the location and scale of the face are predicted by the activation map to avoid additional parameter settings. A large number of simulation experiment results show that our proposed method is superior to other comparison algorithms for the accuracy of occlusion face detection and recognition on the face database. In addition, our proposed method achieves a better balance between detection accuracy and speed, which can be used in the ﬁeld of security surveillance.


Introduction
ere are still some challenging problems in face detection and recognition technology mainly due to the nonrigid features and the influence of complex background [1][2][3]. Traditional face detection algorithms mostly use a semisupervised learning method. Since the traditional method needs to design different artificial features for different tasks, such as grayscale features, contour features, and HOG features, these features are easily affected by the imaging angle, and the generalization ability is poor. At the same time, the object occlusion will also lead to the missed detection, thereby reducing the accuracy of the detector. erefore, it is of great practical significance to study the occlusion problem for face detection and recognition task [4].
e face detection and recognition model based on machine learning is a popular research direction in the field of computer vision [5]. By directly extracting features from the detection area and then using machine learning algorithms to classify and recognize, the accuracy of the model classification can be improved to a certain extent, but the characterization ability of features directly affects the recognition accuracy of the system [6]. Compared with detection and recognition algorithms for shallow learning models such as boosting, decision trees, and neural networks, deep learning represented by convolutional neural networks implements the deep nonlinear network structures through operations such as local receptive fields and weigh sharing. e hierarchical strategy can learn the most essential feature representation in the data set [7]. At present, mainstream deep learning-based face detectors usually adopt a two-stage network structure, which is divided into face detection and face recognition.
Most convolutional neural networks use a classification loss function to measure the difference between the predicted value and the actual value and then complete the classification of the image through the training process to expand the distance between different types of images. Wang et al. [8] used 3-dimensional face information as a feature where the robustness and accuracy of the algorithm are improved through a large amount of data training. Corrow et al. [9] used DeepID for face recognition by partitioning different parts of the face, extracting features separately, and then using the Bayes algorithm to perform complex operations on the features, and finally obtaining face feature information, effectively improving the accuracy of recognition. However, none of the above algorithms solves the recognition problem under nonlimiting conditions. erefore, how to increase the distance between classes while reducing the distance of intraclasses in the recognition process is the important topic of the face recognition task. Abbad et al. [10] realized the feedback of the loss function during the training process by adding a loss verification method and used the positive samples to reduce the distance between the classes, but this method is more dependent on the samples. Madhavan and Kumar [11] proposed a ternary loss algorithm that unifies the training data into triple elements; each triple contains positive value, negative value, and sample anchor point, which can effectively reduce the intraclass distance.
Although the above method can solve some nonlimiting problems, it has poor performance in convergence speed, especially when the number of network layers is too large, and vanishing gradient phenomenon will occur. In order to solve this kind of problem, Su et al. [12] proposed a multiinception structure-based convolutional network neural algorithm for face recognition. By transforming the traditional Softmax loss method and combining Softmax and TripletLoss, a larger interclass distance and a smaller intraclass distance can be obtained. Experiment result proves that the algorithm increases the depth and width of the network, and intraclass spacing can be effectively reduced during the training process. Based on the above description, it can be seen that the convolutional neural network faces different problems when processing different data and application scenarios. Some scenes pay more attention to calculation speed, and some pay more attention to detection accuracy. More scholars strive to find a universal model with high performance in all aspects. is is also the ultimate goal of this study.
Aiming at the problem that occlusion affects the accuracy of face detection and recognition, this paper proposes a deep network with multilevel feature fusion. is network uses a visual attention mechanism to guide the model to highlight the visible area of the occlusion face; the detection recognition problem can be simplified to a highlevel semantic feature detection problem, and the position and scale of the face are predicted by means of activation maps, avoiding additional parameter settings. A large number of simulation experiment results show that the proposed method is better than the existing mainstream method in the detection and recognition of the occlusion face on the public data set and has achieved a faster detection speed, which can be used in the field of security surveillance.
e innovations of this study are summarized as follows: (1) In view of the detection omission caused by occlusion in face detection, a solution is proposed, that is, a visual attention mechanism guidance model is proposed, which uses the visual attention mechanism guidance model to highlight the visible area of the blocked face, thereby improving face detection and recognition accuracy. (2) e new model parameters have been simplified.
rough the improved analysis network, the face detection problem is simplified to the advanced semantic feature detection problem, and the position and scale of the face are predicted by the activation map to avoid additional parameter settings.

Face Detection Network
e YOLO-V3 network is a better deep learning model in the field of object recognition, the network evolved from the YOLO and YOLO-V2 networks [13]. Compared with the deep learning network based on region proposal, the YOLO network transforms the detection problem into a regression problem. e network does not need to adopt exhaustive candidate regions but directly generates the confidence and bounding box coordinates of the object through regression. Compared with the Faster-RCNN network, the detection speed is greatly improved [14]. e YOLO detection model is shown in Figure 1. e network divides each image in the training set into an S × S (S � 13) grid. If the center of the real object falls into the grid, the grid is responsible for detecting the category of the object. Multiple bounding boxes are predicted in each grid, and each predicted bounding box is scored to demonstrate that the bounding box completely contains the confidence of the object, which is defined as follows: where P r (object) indicates the probability of the object contained in the bounding box. If there is an object in the bounding box, we have P r (object) � 1; otherwise P r (object) � 0; IoU truth pred indicates the Intersection over Union (IoU) between the prediction result and the benchmark frame. e confidence reflects whether the grid contains objects and the accuracy of prediction boundary box. When multiple bounding boxes detect the same object, YOLO uses a nonmaximum suppression method to select the best bounding box.
Although YOLO has obtained a faster detection speed, its detection accuracy is not as good as Faster R-CNN. In order to solve this problem, YOLO-V2 introduces the idea of the anchor mechanism in the Faster R-CNN network and uses the k-means clustering method or fuzzy c-means method [15] to generate a suitable prior bounding box. erefore, the number of anchor boxes required by the YOLO-V2 algorithm to achieve the same IoU is reduced. YOLO-V2 improves the network structure and replaces the fully connected layer in the YOLO output layer with a convolutional layer [16]. In addition, YOLO-V2 also introduces batch normalization, dimensional clustering, finegrained features, multiscale training, and other strategies; compared with YOLO, YOLO-V2 greatly improves the detection accuracy. YOLO-V3 is an improved model based on YOLO-V2. By using multiscale prediction to detect the final object, its network structure is more complicated than YOLO-V2. YOLO-V3 can predict the bounding boxes of different scales, which can detect small objects more effectively than YOLO-V2, but there are still missing detections for partially occluded face objects [17].

Occlusion Face Detection and Recognition
Algorithm Combined with the Visual Attention Mechanism Figure 2 is our proposed face detection and recognition model in this paper, which consists of two parts: feature extraction network and face analysis network. e input image is extracted by the feature extraction network to extract high-level semantic features, and the feature-guided attention module is used for feature fusion; the face analysis network predicts the face position, height, and offset heat map on the basis of the obtained high-level semantic feature and obtains the face boundary box.

Feature Extraction Network.
e feature extraction network is designed on the basis of feature pyramid networks (FPN) [18][19][20][21], including basic networks and visual attention networks. ResNet50 has excellent performance in visual tasks such as image classification, so it is used as the basic network. ResNet50 can be divided into 5 levels, and the downsampling rate of each level relative to the input image is i; {1, 2, 3, 4, 5} represents the number of levels. In order to make full use of the location information of the shallow feature map and the semantic information of the deep feature, the shallow and deep feature maps are used to guide the attention network for feature fusion. e process can be described as follows: firstly, the number of C3 and C4 feature channels can be reduced to 256 by the convolutional layer with the size of the 1 × 1 convolution kernel, which can reduce the amount of calculation; then, the backbone network feature maps (namely, P3 and P4) after bilinear interpolation and upsampling 2 times are, respectively, input into the guided attention module to feature fusion [22][23][24].

Visual Attention Network.
Different feature channels of the convolutional network will have different responses to a specific area of the face, which is to say that the occlusion form of the object can be described through different feature channels, and the occlusion form O(n) is defined as the following equation: wherep i denote different areas of face objects and v i ∈ 0, 1 { }, i ∈ [0, k] is used to indicate whether the partial area i of the face is visible. e weights of traditional CNN channels are usually fixed and the same, which limits the network's ability to express different occlusion forms [19,21,[23][24][25][26]. Patil et al. [27] recalibrated the weight of each channel so that the feature channel expressing the visible area of the occlusion object has a greater contribution to the final convolution feature, which can highlight the occlusion object in the background.
e channel weighting process can be expressed as the following equation: where F c is the channel feature and Ω n is the channel weighting vector corresponding to the occlusion form n. e visual attention module is to get the attention vector Ω n through learning and finally achieve the reweighting of the feature channel for Ω n so that the network can adaptively express different occlusion forms. However, the existing models only consider the relationship between channels and ignore the importance of spatial information for the feature map. Because the spatial information of the feature map is helpful for the network to locate the region of interest, the feature channel attention mechanism and the spatial attention mechanism are used in the feature description task in literature [28]. Similarly, spatial attention mechanism is applied to object detection tasks in literature [29], so as to guide the network to highlight the useful features for current tasks. On the basis of the above description and analysis, the feature space information is used in the face detection and recognition task to highlight the occlusion of the face object area, and the spatial attention module is constructed to achieve the face detection and recognition task. e spatial attention module obtains the spatial attention map from the spatial information of the statistical feature map, which is used to reactivate the input features, so as to guide the network to focus on the occluded face and suppress the background interference. As shown in Figure 3, the visual attention network consists of two submodules: channel and spatial attention. e input of the visual attention network is two feature maps Scientific Programming (such as C4 and P4) from the shallow convolution layer and the deep convolution layer, respectively. Firstly, the input features are connected in the channel dimension to get F ∈ R H×W×C , and then F is input to channel attention module. After a series of operations, the spatial attention module is to achieve the feature fusion fusion. erefore, by using the attention module to model the correlation between feature channels and the spatial information of the feature map, the network can not only enhance the feature representation of the relevant areas but also obtain the location information of the area of interest [29]. While making full use of the useful features to deal with the problem of face occlusion, it also suppresses the useless clutter information, which is conducive to improving the accuracy of face and recognition detection. Figure 4 is the feature channel attention module proposed in this paper. For the input feature map F, the global information of each feature channel is obtained by global average pooling and maximum pooling operations to form the channel descriptor z avg c and z max c , and then the feature channel attention vector Ω c ∈ R 1×1×C is obtained through the two fully connected layers FC1 and FC2. Finally, the deep learning method makes the network to automatically characterize the occlusion form of different samples. e specific steps are shown in equation (3).
where σ and δ are sigmoid function and ReLU function, respectively. W 1 ∈ R Clr×C and W 2 ∈ R C×Clr represent two fully connected layer parameters, where r is the ratio of downsampling dimensionality reduction. Ω c ∈ R 1×1×C is used to weight the input feature F channel by channel to obtain F ′ . e process can be written as the following equation: where ⊗ represents the dot-product channel by channel.
Since the useful information for partial-occluding face objects is usually obscured by the background, the network also needs to determine the spatial location of the useful information while enhancing the feature expression of the occlusion object through the channel attention module. Unlike the channel attention mechanism, the spatial attention mechanism is mainly used to highlight the areas in the feature map that are related to the current task, which is to guide the network to focus on the visible area of the occlusion object [30].
In the spatial attention module, the maximum pooling operation is firstly performed on the input feature map F ′ in the channel dimension to obtain the feature map F max ′ ∈ R H×W×1 , which is used to count the spatial information of the feature map; then, the feature map is input into a 3 × 3 convolution layer f c and output by the sigmoid function to obtain the spatial attention map M s ∈ R H×W×1 : where σ is the sigmoid function. Finally, the spatial attention map M s is used to reactivate the input F ′ to obtain the final feature map F ″ : where ⊗ represents the dot product between feature maps. Face detection and recognition task is regarded as a high-level semantic feature detection problem. On the basis of obtaining semantic features, the final prediction bounding box is obtained through the face analysis network. In this paper, the position, height, and position offset of the face are firstly predicted, and the size of the bounding box is obtained by simple geometric transformation, and then, the simple recognition network can get the high-precision recognition effect [31]. Specifically, after the predicted height h of the face is obtained, the width w � h · α of the bounding box can be calculated by the length-width ratio of the bounding box. If the output feature map of the feature extraction network is F final ∈ R H/s×W/s , three heat maps are predicted by three parallel 1×1 convolutional layers and correspond to center position H c ∈ R H/s×W/s , height H h ∈ R H/s×W/s and position offset H offset ∈ R H/s×W/s , respectively. s is the sampling rate of the output activation map relative to the input image. By predicting the heat map, the limitation of the prior frame adopted by the traditional method is avoided, and a more flexible face detection and recognition is realized in the same network.

Position Prediction.
Face location prediction is achieved through the location heat map H c . In this paper, the position prediction problem is simplified as a binary classification problem. e position of the object center on the feature map F final is (x c , y c ). And the object center pixel is selected as a positive sample and the other positions as negative samples. e cross-entropy loss function is used to optimize the training position-prediction branch. e training true value H gt c is generated by a 2D Gaussian function, and the truth value at any position can be obtained by calculating equation (8): where(x c , y c ) is the central location of the object and σ w and σ h are variances of the width and height of the object, respectively. In order to alleviate the imbalance of positive and negative samples in the training process, focal loss was defined as the predicted loss function of the center position: where p(i, j) indicates the prediction score of the object center at (i, j) in the prediction heat map, N is the number of objects in the picture, and α and β are the balance factors, generally set to 2 and 4.

Height Prediction.
Given the position of the face k in the height heat map which is (x k , y k ), its corresponding true value is H gt h (x k , y k ) � log(h k ), where h k denotes the height of the object k. In this paper, the true value within the radius r of (x k , y k ) is set to log(h k ), and the radius r is set according to the width of the object, which is generally set as r � 0.5w k . Our proposed model in this paper uses the L1 loss function for training. e loss function is denoted as follows: where h k is the predicted height of the object k in the heat map and N is the number of objects in the image.

Deviation Prediction.
Since the convolutional network is usually a downsampling process, the position (x, y) on the input image is mapped into the heat map, whose position can be expressed as (x/s, ty/s), where s is the downsampling rate of the network. When the position on the activation map is remapped back to the input image, an error will be generated, especially affecting the detection and recognition result of the dim-small face. To alleviate this problem, the position prediction of the object is corrected by predicting the deviation/offset of the center position [32], and the corresponding true value can be rewritten as the following equation: Finally, the multitask loss function weighted optimization can be adopted to train our proposed network, where the weighted loss function can be denoted as follows: Scientific Programming where λ c , λ h , and λ o are weighting factors, which are set to 0.01, 1, and 0.12, respectively。

Experimental Data Set and Parameter Settings.
In order to evaluate the performance of the face detection and recognition algorithm based on the visual attention-guided mechanism proposed in this paper, LFW (labeled faces in the wild database), CMUFD database (CMU face detection database) [23], and UCFI database (UCD color face image) [25] are used as face detection and recognition data sets. It consists of 500 images with a resolution of 2048 × 1024. Since CMUFD contains a large number of partial occlusion face images, it is selected as the verification and comparison test of the proposed method; UCFI contains about 350,000 face samples, where the standard testing set consists of 4,024 images with a resolution of 640 × 680 in a simple scenario. In order to verify the generalization of our proposed method, some testing experiments are performed on the UCFI data set. Face objects of all training data have been accurately marked. Except for the object area, the rest is marked as background, which means that the labeled data set can be used for training and testing of face detection and recognition models. e network proposed in this paper selects pretrained ResNet50 as the backbone network. Its parameters are set as follows: depth � 40, growth_rate � 12, bottleneck � True, reduction � 0.5, minibatch is set to 16; learning rate is set to 0.001, dropout parameter is set to 0.8, and the maximum number of iterations is set to 10,000; in order to improve the optimization efficiency, this paper uses the Adam optimization algorithm. e Adam optimization algorithm is an extension of the stochastic gradient descent algorithm, which can iteratively update the neural network weights based on the training data; the initialization of learning rate is set to 0.25; then, when training to the 30th epoch, the learning rate is changed to 0.025. e nonmaximum suppression algorithm (NMS) is used to filter out the redundant face results. e threshold of Intersection over Union (IoU) is set to 0.5, and only the face results with the object confidence score greater than 0.1 are retained [33].

Evaluation Index.
At present, the video surveillance intelligent analysis system has been able to detect and recognize the face with different scales. However, existing algorithms have a large number of false detection for face objects under partial occlusion mainly due to incomplete occlusion of face objects and the similarity of the face and background gray. erefore, the standard evaluation indicators are selected as performance evaluation, which is the false positive per image (FPPI) of each image, focusing on the frequency of occurrence of false positive, as shown in Table 1.
In the detection stage, the evaluation criteria are the detection rate (DR) and the false detection rate (false positive per image, FPPI): , where TP represents the number of positive samples detected correctly, TP + FN represents the number of positive samples included the image, and FP represents the number of false positive samples. In addition, we also use the logaverage miss rate (MR) to characterize the performance of the detector for the face. is paper mainly focuses on the occlusion situation for face detection and recognition. erefore, we define the visible range to characterize the occlusion situation of the face. Given that the proportion of the object visible area to the total area is λ, if λ > 0.7, it means that the object is in a normal state and is denoted as N; if 0.2 < λ < 0.7, it means that the object is in a serious occlusion state and is denoted as H. For ease of analysis, we will also divide the data set into four types of subsets, which are, respectively, recorded as mixed face data set (mixed), bare occlusion face data set (bare), partially occlusion data set (partial), and severely occlusion data set (heavy) [34].

Performance Analysis for Face Detection.
In order to verify the effectiveness of the feature-guided attention network, the detector with the attention network removed is used as a test baseline (baseline), and Face++ proposed in [17] is used as a comparison method. Baseline adopts the feature fusion method consistent with FPN to build the model, where CA means the channel attention module and SA means the space attention module, and the performance of the detector after adding each module is compared in the experiment, and the test results are shown in Table 2.
Compared with the baseline model, the face detector's missed detection (MR) of the occlusion image has a significant decrease after adding the attention module, indicating that our proposed attention mechanism can effectively guide the detector to focus on the occlusion object. Compared with Face++, the proposed method has a significant reduction in MR under all evaluation criteria, especially under the severe occlusion evaluation criteria, and its missed detection (MR) decreased by 18.2%, indicating that the proposed method has good effectiveness for face detection in the complex scenario.
In order to more intuitively understand the attention module on the performance of the face detector, Figure 5 gives visualized face prediction results. Figure 5(a) is a face image, and input it to Baseline, Baseline + CA, and Baseline + CA + SA, respectively, to obtain the position prediction. rough observation, it can be found that the feature response in Figure 5(d) is closer to the face visible area, while there is still background interference in Figures 5(b) and 5(c), but the interference of Figure 5(c) is significantly less than that of Figure 5(b). is also proves that the attention module can guide the network to highlight to the visible part of the occlusion face, while also reducing the impact of background noise on detection performance.

Performance Analysis for Face Recognition.
Since the recognition algorithm based on deep learning has achieved great results in the field of natural image, some classical deep learning algorithms will be used to make a comparison with our proposed algorithm. In order to qualitatively and quantitatively analyze the accuracy of the proposed algorithm for occluded face detection, this paper selects the comparison algorithm as FACEILD [35], Faster-FCC [36], KSDD [37], DNET [14], ResNet [7], and ConvNet [38] to further verify the performance of the proposed method.
From the experiment results in Table 3, it can be seen that the face detection results proposed in this paper are better than the DNET mainly due to the improvement of the face detection accuracy of the attention perception fusion module and the use of the multiscale pyramid pooling layer to capture high-level semantic features. e complementary features can effectively preserve the clear boundary of the face, while the combination of the multiple side output and pyramid pooling layer output can extract rich global context information and adapt to the two classification problems of face recognition. e heavy data set contains the most complex face image in the whole testing data. e face is seriously occluded, especially the image contrast is small and the face is fuzzy, which directly affects the detection and recognition effect of the network. From the precision comparison results in Table 3, it can be seen that the detection rate on the heavy data is the lowest mainly because the occlusion greatly reduces the perception ability of the deep network. However, the face detection based on the visual attention-guided mechanism proposed in this paper is also better than other deep networks. It can be seen that our proposed algorithm achieves 59.78% detection accuracy under the heavy evaluation standard, which is better than the comparison methods. In terms of detection efficiency, the detection and recognition speed of the input image with a resolution of 1024 × 2048 is 0.22 s, achieving a good balance between speed and accuracy. If the input image with a smaller resolution is detected, the detection speed of this method will be further improved.
In the selected detection image, the shape and scale of the face are quite different, especially the gray level of the face and the adjacent background is similar. According to the maximum analysis of the response map, these appearance changes cannot get the accurate boundary, which leads to ConvNet, KSDD, and FACEILD cannot get the accurate face. However, it is proved that the face with serious   occlusion such as deformation and low contrast can be accurately detected and recognized. KSDD is a lightweight network structure based on the VGG network. Although it can balance the contradiction between robustness and speed, it is still easy to be disturbed by occlusion, resulting in deviation of the detection center. From the detection results, it can be seen that the ConvNet detection and recognition has deviated from the face center. Our proposed model in this paper uses the attention-guided mechanism to highlight the visual area of occluded faces so that our proposed algorithm in this paper can better adapt to the influence of occluded interference in face detection.

Generalization Analysis.
In order to verify the generalization performance of the proposed method, the proposed method was trained on the LFW training set, and cross-data set experiments were performed on the CMUFD database. e heavy subset consists of face objects with a height greater than 50 pixels and a visible range of [0.20, 0.65]. As shown in Figure 6, FPPI represents the statistical results of face detection and recognition algorithms at different detection rates. In order to facilitate comparison in different deep networks, the experiment mainly discusses the detection results of each algorithm when FPPI � 1 for analysis. e recognition rate of the ResNet algorithm is 91.88%, the recognition rate of the KSSD algorithm is 51.91%, and the detection rate of DenseNet algorithm is only 58.69%. e reason is that most deep detection methods only use the side-output feature and ignore the importance of global structural features. Our proposed paper uses a visual attention mechanism to guide the model to highlight the occlusion object visible area and simplify the face detection and recognition problem to a high-level semantic feature detection problem through an improved analytical network and uses the activation map to predict the location and scale of the face, which can avoid additional parameter settings and further reduce the false detection rate of each image. It can be clearly observed from Figure 6 that the performance of the proposed algorithm is obviously another algorithm.

Conclusions
Performance of face detection and recognition is affected and damaged because occlusion often leads to missed detection. In order to improve the accuracy of face detection and recognition, a visual attention mechanism guidance model is proposed in this paper, which uses the visual attention mechanism to guide the model highlight the visible area of the occluded face. e face detection problem is simplified into the high-level semantic feature detection problem through the improved analytical network, and the location and scale of the face are predicted by the activation map to avoid additional parameter settings. A large number of simulation experiment results show that our proposed method is superior to other comparison algorithms for the accuracy of occlusion face detection and recognition on the face database. In addition, our proposed method achieves a better balance between accuracy and speed, which can be used in the field of security surveillance. However, the performance of the proposed algorithm is sensitive to parameters, and its generalization is not high. How to improve this problem will be more conducive to the model applied to other scenarios or data.

Data Availability
All the data used to support the findings of this study are available within the article.

Conflicts of Interest
e author declares that there are no conflicts of interest.

Acknowledgments
is work was financially supported by the key project of Education Bureau of Guangdong Province (Exploring the