A Convolutional Neural Network Face Recognition Method Based on BiLSTM and Attention Mechanism

Face recognition technology is a powerful means to capture biological facial features and match facial data in existing databases. With the advantages of noncontact and long-distance implementation, it is being used in more and more scenarios. Affected by factors such as light, posture, and background environment, the face images captured by the device are still insufficient in the recognition rate of existing face recognition models. We propose an AB-FR model, a convolutional neural network face recognition method based on BiLSTM and attention mechanism. By adding an attention mechanism to the CNN model structure, the information from different channels is integrated to enhance the robustness of the network, thereby enhancing the extraction of facial features. Then, the BiLSTM method is used to extract the timing characteristics of different angles or different time photos of the same person so that convolutional blocks can obtain more face detail information. Finally, we used the cross-entropy loss function to optimize the model and realize the correct face recognition. The experimental results show that the improved network model indicates better identification performance and stronger robustness on some public datasets (such as CASIA-FaceV5, LFW, MTFL, CNBC, and ORL). Besides, the accuracy rate is 99.35%, 96.46%, 97.04%, 97.19%, and 96.79%, respectively.


Introduction
Face recognition technology is to extract facial features from the input of face images and then realize the function of face recognition. Face recognition technology has been one of the hottest research topics in recent years. It combines many specialized technologies and has been extensively researched and developed. Te principle of face recognition technology mainly contains four parts: the acquisition and preprocessing of face images, face detection, face feature extraction, and face recognition. Face detection is mainly used to mark the location and size of faces in images. Te extraction of facial features is to model the extracted facial features. In modern society, face recognition technology has been widely used in government, education, and other felds, such as access control, ATM, and attendance. Face recognition technology [1] has attracted much attention.
Te rapid development of deep learning has made face recognition technology more mature. Since the temporal characteristics of real pictures of human faces will be afected by factors such as illumination, posture, and background environment, the face recognition technology will be inaccurate. Face recognition technology is not well applied in the case of recognizing multiple face pictures of the same person in a short time. For example, the CASIA-FaceV5 dataset includes 2,500 images of 500 people, with an average of 5 diferent images per person with the same background. Te classic deep learning-based face recognition method does not consider the timing characteristics between different photos of the same person. Terefore, the recognition efciency is low. Tis paper considers the temporal and geometric features of diferent face images. A convolution neural network face recognition method based on BiLSTM and attention mechanism called AB-FR is proposed.
Te model proposed in this paper has the following contributions: (1) Te proposed model is based on CNNs, and an attention mechanism is added to extract important face block features and better process face feature information (2) Te BiLSTM (bidirectional long short-term memory) is added to the proposed model to capture the sequence features of the image and extract the hidden temporal features, so as to further improve the accuracy of feature selection (3) Te cross-entropy is used as a loss function to classify the data, thus improving training efciency and classifcation accuracy Te rest of this paper is organized as follows. Section 2 describes the related work. Section 3 briefy introduces some background knowledge about CNN, BiLSTM, and SENet. Section 4 presents our AB-FR model in detail and describes our experimental results and discussion. Section 5 concludes the paper.

Related Work
Traditional face recognition techniques include facial recognition based on geometric features [2], facial recognition based on feature faces [3], and methods based on hidden Markov models [4]. In face recognition based on geometric features, geometric features can be eyes, nose, mouth, and other shapes and geometric relationships between them (such as their distance from each other). Yuangen et al. [2] used the AdaBoost algorithm combining skin color detection and geometric features, which uses skin color to roughly flter face candidate regions. However, face detection based on skin color is sensitive to light and noise environments, which makes the face detection efect unsatisfactory.
Te face recognition method based on feature face refects the information hidden in the face sample set and the structural relationship of the face. It is easily afected by factors such as facial expression, posture, and lighting and has great limitations. Wang et al. [3] proposed an image processing method based on the fusion of facial key detection points and grayscale transformation. It reduces the complexity of face recognition by adjusting the saliency, contrast, and data complexity of face image features. However, when using low-resolution images, due to their low feature information and high sensitivity to image occlusion, this method retains insufcient features and has a low recognition rate.
In the face recognition method based on the hidden Markov model (HMM), the hidden Markov model is a probabilistic model about time series, which can extract the face image observation sequence. Each area of the face has multiple states, through which the HMM of the face image can be established. Wang and Liu [4] used the facial features extracted from the hidden Markov model and the characteristics of the Viterbi algorithm to segment the facial feature sequence. But it does not consider the correlation between sequence annotation and the length and context of the observed sequence, so the correct rate of face recognition is not high enough.
In recent years, deep learning [5] has been widely used, especially in the face recognition feld. It is diferent from the traditional methods. Te face recognition method based on deep learning is more inclusive in face feature selection and has strong adaptability to the infuence of lighting and changes in expressions. Common face recognition methods based on deep learning include MTCNN [6], DeepID [7], FaceNet [8], and DeepFace [9].
MTCNN [6] adopts the thought of the cascade, including P-Net, R-Net, and O-Net three-layer networks. It uses the image pyramid algorithm, which needs to scale the image several times, resulting in a large number of forwarding operations, which seriously slows down the detection speed. FaceNet [8] is a deep convolutional neural network. At the heart of this method are millions of training data and Triplet-Loss. Terefore, the model requires a lot of calculation and runs slowly. DeepFace [9] improves the face alignment method using explicit 3D face modeling and piecewise afne transformation applied to frontal faces. But the model has to be rebuilt with each call, resulting in a very high memory requirement.
It can be seen from the above that a convolutional neural network is a commonly used deep learning method in face recognition. It has good self-organization and adaptive abilities and can implicitly describe many laws of face recognition through the learning process. It is more adaptable, but it also has some drawbacks that make it less efective.
Tao et al. [10] proposed the combination of CNN and metric learning methods for face recognition, which uses a multi-Inception structure to extract facial features. But it requires too much data sample. Hu et al. [11] fused the extracted multi-layer features in the subspace, further improving the face recognition efect. However, the number of parameters is too large and the operation is slow. Wu and Zhang [6] changed the SoftMax layer in MobileNet to L-SoftMax to avoid overftting and achieve better classifcation results. But it deepens the network and increases the running time, which is not efcient. Ren and Xue [12] proposed an R-CNN face recognition method, which combines ResNet with CNN. However, the scale of the network is too large and requires the support of a large number of datasets.

Method of AB-FR
In this section, our network is based on CNN structure and we add BiLSTM to extract the bidirectional time series features of the images. Ten, an attention mechanism is added to extract the important features of the images. Finally, the AB-FR model is formed.

Convolutional Neural Networks.
Te underlying network structure used in this paper is CNN (convolutional neural network) [13][14][15][16]. It is a feedforward neural network that avoids complex image preprocessing. Te diference between convolutional neural networks and general neural networks is that the local connection and weight sharing reduce the number of weights to be trained, thereby reducing the learning complexity of the network model. Figure 1(a) shows the local connection, and Figure 1(b) shows the full connection. Figure 2 represents weight sharing. Te structure of the convolutional neural network includes a convolutional layer, a pooled layer, and a fully connected layer. Its classic model is shown in Figure 3.

BiLSTM.
Te CASIA-FaceV5 dataset used in this paper includes images of 500 people, and each person contains fve diferent images of the same background. Te LFW dataset has a total of 5,749 people, including 13,233 face images, and some of them have more than two images. Te MTFL dataset contains nearly 13,000 face images from the Internet. Te CNBC dataset collects mugshots of 200 people from various states. Te ORL dataset includes the faces of 40 people, each with 10 images. Tere must be a temporal relationship between images of diferent periods or diferent angles of each person, so we add BiLSTM to the network model. In this way, the time series information between diferent images can be fully considered, so that the time series feature vector of the face image can be obtained and sent to the attention mechanism.
Long short-term memory (LSTM) network [17] is a time series convolutional neural network derived from a recurrent neural network. By introducing a gate function, it is possible to mine the time series variation laws of relatively long intervals and delays in time series. Te bidirectional long short-term memory (BiLSTM) network consists of a forward LSTM network and a backward LSTM network, and its network structure is shown in Figure 4.
Among them, w 1 − w 6 represents six shared weights, which are calculated forward in the forward layer, and the output h t of the forward hidden layer from time 1 to time t is obtained and saved. In the backward layer, a reverse calculation is performed to obtain the output h t ′ of the backward hidden layer from time t to time 1, and it is saved. Finally, combined with the output result of the corresponding moment of the forward layer and the backward layer, the fnal output O t is obtained, and its expression is as follows: (1) By adding BiLSTM, this paper extracts the bidirectional sequence features based on images and automatically generates the context relationship of the sequence, which effectively increases the amount of information available to the network model, improves the context information available to the algorithm, and improves the accuracy of face recognition. [18] is a data processing method in machine learning. Generally speaking, when people observe external things, they tend to observe some important local information of things frst. Ten, information from diferent regions is combined to form an overall impression of what is being observed.

Attention Mechanism Module. Attention mechanism
Tis paper adopts the typical channel attention mechanism SENet to capture more important feature information, and its calculation is mainly divided into two steps.
Step 1. Squeeze operation: perform a global average pooling operation on the input features and compress the H × W × C features into a 1 × 1 × C size. (2) Step 2. Excitation operation: perform two full connection operations on the result of the squeeze operation and then use sigmoid activation to obtain the weight matrix.
Its specifc implementation is shown in Figure 5. Tis paper adds an attention mechanism to flter out important information from a large amount of information. Trough automatic learning, the neural network can get the importance of each channel in the feature map and make it more focused on some feature channels. In this way, we can increase the channel weights of the current task and suppress the feature channel weights that are useless for the current task, thereby improving the performance of the network model.

Experiment and Analysis
In this section, we introduce modules such as datasets and preprocessing, experimental environment and parameter setting, experimental model, experimental results, comparison with other algorithms, and ablation experiment. Before data training, the image needs to be preprocessed. Firstly, grayscale the image. Ten, use the frontal_face_detector that comes with dlib to detect faces. Ten, randomly adjust the brightness and contrast of the image to increase the sample diversity. Finally, resize the image to 64 × 64. Te original data are divided into the training set and test set according to the probability of 3 : 7, and the face to be recognized and the labels of other faces are divided. Tis paper uses the CASIA-FaceV5, LFW, MTFL, CNBC, and ORL face datasets. Figure 6(a) shows some face images before preprocessing on CASIA-FaceV5. Figure 6(b) shows some face images after preprocessing on CASIA-FaceV5. Figure 7(a) shows some face images before preprocessing on LFW. Figure 7(b) shows some face images after preprocessing on LFW.

Experimental Environment and Parameter Settings.
Te experiments are all carried out on the platform of i7-6000CPU, 3.40 GHz, and 8G memory, using Python and NumPy for matrix calculation and using TensorFlow to develop the improved convolutional neural network at the back end.
In the parameter setting module, taking the LFW dataset as an example, we conducted experiments. Te following two parameters, batch size and convolution kernel, are used to illustrate.
Setting an appropriate batch size [19] can make the gradient descent direction more accurate and improve memory utilization and training speed. However, if the batch size is too small, the diversity of the samples will be lost, so the robustness of the trained neural network is not good. Too large a batch size will waste a lot of computing space, so you need to set a suitable batch size. Te evaluation indicators under diferent batch sizes are shown in Table 1 and Figure 8.
It can be seen from Table 1 that the bold values perform best. When the batch size is 32, the recognition rate is the highest, so 32 is chosen as the batch training size.
Te size of the convolution kernel determines [20] the output size. In theory, the smaller the convolution kernel, the better. But for particularly sparse data, when we use a relatively small convolution kernel, it may not be able to represent its features. Larger convolution kernels will lead to an increase in complexity. So, we need to choose the appropriate convolution kernel size. Table 2 shows the evaluation metrics of faces under diferent sizes of convolution kernels.
From Table 2, it can be concluded that the bold values are the results with a 3 × 3 convolution kernel, making the face recognition best. Terefore, this paper chooses a 3 × 3 convolution kernel for experiments.
In CNN, batch size, learning rate, optimizer, pooling size, activation functions, loss functions, and kernel size are several important parameters. After many experiments, we have made the following settings for these parameters, as shown in Table 3.

Algorithm Model.
Tis paper uses the CASIA-FaceV5, LFW, MTFL, CNBC, and ORL datasets for sample training and prediction. Te AB-FR model diagram is shown in Figure 9. Our input image size is 64 × 64 × 3. First, the features are extracted through three convolutional layers, and the number of channels is increased, but the image size is unchanged. Te size of the convolution kernel we use is (3,3), and the convolution stride is [1, 1, 1, 1]. Ten, the input image is sampled through three pooling layers. Tis paper uses maximum pooling with a sampling size of 2 × 2. In this way, the length and width of the output feature map are half of the input feature map. Finally, add a dropout layer after each pooling layer to prevent overftting. Besides, the features are input into BiLSTM to extract bidirectional information, and then the learned important features are extracted through the channel attention mechanism. Finally, in order to enhance the network's nonlinear ability and limit the network size, the network accesses two fully connected layers after the feature extraction layer extracts features. Each neuron in the fully connected layer is interconnected with all neurons in the  previous layer, and it squashes the output of the image convolutional layer into a one-dimensional vector of 2 × 512. Its model is shown in Figure 10.

Experimental Results of AB-FR.
Te performance of the AB-FR model on diferent datasets (CASIA-FaceV5, LFW, MTFL, CNBC, and ORL) is shown in Figure 11. It can be seen that the AB-FR model proposed in this paper has a high recognition rate on the CASIA-FaceV5 dataset. Because each person in CASIA-FaceV5 has fve diferent pictures under the same background, there are some timing characteristics. Te AB-FR proposed in this paper adds BiLSTM, fully considering the timing characteristics between pictures. Also, the attention mechanism is added to improve the extraction of dominant features, thereby improving the accuracy of face recognition. To avoid randomness, experiments are repeated ten times for each dataset. Specifcally, each dataset is divided according to 3 : 7, in which one-seventh of the images are used as training samples, and one-third of the images are used as test samples. Te best result of ten experiments is taken as the fnal result of this experiment. Figure 11(a) shows the Acc and loss of AB-FR on CASIA-FaceV5. Figure 11(b) shows the Acc and loss of AB-FR on LFW. Figure 11(c) shows the Acc and loss of AB-FR on MTFL. Figure 11(d) shows the Acc and loss of AB-FR on CNBC. Figure 11(e) shows the Acc and loss of AB-FR on ORL. Figure 12 shows the output of 10 tests on diferent datasets as a box plot. As can be seen from the fgure, the model proposed in this paper has a slightly higher result on the CASIA-FaceV5 dataset while a slightly lower result on the LFW dataset and a wider width of the box diagram. Because BiLSTM is added to the model proposed, the bidirectional sequential features of images are better considered. Tere are fve diferent photos in CASIA-FaceV5 dataset, which are taken consecutively by diferent people in the same background, with certain temporal characteristics. LFW dataset, which contains more than 13,000 images from the Internet, has huge data and mixed backgrounds, and most of the images only have one photo. Terefore, the model proposed in this paper has a slightly better efect on the CASIA-FaceV5 dataset and a slightly worse efect on the LFW dataset.
In order to verify the statistical signifcance of the model presented in this paper, the Kruskal-Wallis test is used. Table 4 shows the P value and Cohen's f value obtained from the Kruskal-Wallis test between AB-FR and the reference algorithms, where * * * , * * , and * represent the signifcance level of 1%, 5%, and 10%, respectively, and Cohen's f value represents the efect size. Te distinguishing critical points of small, medium, and large efect sizes are 0.1, 0.25, and 0.40, respectively, which refect the diference amplitude. As can be seen from the table, P values of test results are all less than 0.05, and Cohen's f values are all greater than 0.25. Terefore, the statistical results are signifcant, and there are moderate or large diferences between AB-FR and other algorithms, indicating that the proposed AB-FR has a good performance.

Comparison with Other
Algorithms. Te improved convolutional neural network algorithm proposed in this paper has a better recognition rate compared with other algorithms (PCA algorithm, improved PCA algorithm, Gabor + PCA algorithm, BP algorithm, traditional CNN algorithm, MobileNet, SCNN, and so on). Te results are shown in Table 5 and the best results are the bold. Figure 13 shows the accuracy comparison of the AB-FR algorithm proposed in this paper, the PCA algorithm, the improved PCA algorithm, the Gabor + PCA algorithm, the BP algorithm, and the traditional CNN algorithm. Te ResNet + normalization + centerloss method proposed in the literature [10] focuses on normalization and loss functions, resulting in a higher recognition rate on the LFW and CNBC datasets. But it does not consider the temporal relationship between pictures, so the accuracy of CASIA-FaceV5 is not as good as the AB-FR proposed in this paper.
Te experimental results show that the convolutional neural network structure designed in this paper has a good efect on the face recognition rate, and the results are better than those of the existing algorithms.    Figure 14(a) shows the Acc and loss charts of CNN on the CASIA-FaceV5 dataset. Figure 14(b) shows the Acc and loss charts of CNN + BiLSTM on the CASIA-FaceV5 dataset. Figure 14(c) shows the Acc and loss charts of CNN + attention on the CASIA-FaceV5 dataset. Figure 14(d) shows the comparison of various models on the CASIA-FaceV5 dataset. Figure 15 shows the comparison of the models on the LFW dataset. Figure 16 shows the comparison of the models on the MTFL dataset. Figure 17 shows the comparison of the models on the CNBC dataset. Figure 18 shows the comparison of the models on the ORL dataset. Table 6 indicates the results of the ablation experiment, and the best values are the bold. From Table 6 and Figures 14-18, it can be concluded that       (1) Compared with the two models of CNN + BiLSTM and CNN + attention, CNN has a good recognition efect. But the accuracy is also slightly lower than that of the AB-FR model proposed in this paper. Also, it can be seen from Figure 11(a) that the CNN converges slowly within some time, showing a gentle trend, and there is some gradient disappearance problem.
(2) Te recognition rate of CNN + BiLSTM is low, which can only reach 68.5%. It can be seen from Figure 11(b) that the CNN + BiLSTM model cannot complete the task of face recognition very well. (3) Te CNN + attention model is similar to the AB-FR model proposed in this paper, and the recognition efect and convergence speed are better than those of  Computational Intelligence and Neuroscience   CNN and CNN + BiLSTM. But according to Figure 11(c), the recognition efect and convergence speed of CNN + attention are slightly inferior to the AB-FR model. (4) Te results of the AB-FR model proposed in this paper are the best. It extracts salient features through the attention mechanism, avoids the loss of important features, reduces network training parameters, and improves training speed. With the addition of BiLSTM, the global and local features are comprehensively considered to obtain more accurate timing, so that the network model has a fast convergence speed and a high recognition rate.

. Conclusions
Tis paper proposes a convolutional neural network face recognition method, AB-FR, based on BiLSTM and attention mechanism, aiming at the slow convergence speed of traditional CNN and the problems of occlusion and expression changes in face recognition in practical applications. Te network uses CNN as the basic network structure, extracts key feature information, and assigns important weights by adding a channel attention mechanism. At the same time, BiLSTM is used to extract image time series features, which increases the reliability and efectiveness of image feature extraction. Also, we use cross-entropy as the loss function for supervised training to improve the accuracy of the model. Te experimental results show that the method proposed in this paper has a signifcant efect on improving the convergence speed of the model without increasing too much computational overhead. It also has a good efect on improving the accuracy of face recognition. In the future, we will consider the face recognition problem in images containing multiple faces. Also, we will consider the impact of low image quality on face recognition after data compression and decoding.

Data Availability
Te datasets in the paper are all public datasets, and the acquisition methods are as follows. Te CASIA-FaceV5 data used to support the fndings of this study have been deposited in http://biometrics.idealtest.org/dbDetailForUser. do?id=9#/datasetDetail/9. Te LFW data used to support the fndings of this study have been deposited in http://viswww.cs.umass.edu/lfw/index.html. Te MTFL data used to support the fndings of this study have been deposited in http://mmlab.ie.cuhk.edu.hk/projects/TCDCN.html. Te CNBC data used to support the fndings of this study have been deposited in https://github.com/260963172/CNBC/ releases/tag/cnbc. Te ORL data used to support the fndings of this study have been deposited in https://github.com/ 260963172/ORL.

Conflicts of Interest
Te authors declare that there are no conficts of interest regarding the publication of this paper.