ANewMultiface Target Detection Algorithm for Students in Class Based on Bayesian Optimized YOLOv3 Model

Deep learning theory is widely used in face recognition. Combined with the needs of classroom attendance and students’ learning status monitoring, this article analyzes the YOLO (YouOnly Look Once) face recognition algorithms based on regressionmethod. Aiming at the problem of small target missing detection in the YOLOv3 network structure, an improved YOLOv3 algorithm based on Bayesian optimization is proposed. +e algorithm uses deep separable convolution instead of conventional convolution to improve the Darknet-53 basic network, and it reduces the amount of calculation and parameters of the network. A multiscale feature pyramid is built, and an attention guidance module is designed to strengthen multiscale fusion, detecting different sizes of targets. +e loss function is improved to solve the imbalance of positive and negative sample distribution and the imbalance between simple samples and difficult samples. +e Bayesian function is adopted to optimize the classifier and improve the classification efficiency and accuracy, ensuring the accuracy of small target detection. Five groups of comparative experiments are carried out on public COCO and VOC2012 datasets and self-built datasets. +e experimental results show that the proposed improved YOLOv3 model can effectively improve the detection accuracy of multiple faces and small targets. Compared with the traditional YOLOv3 model, the mean mAP of the target is improved by more than 1.2%.


Introduction
In recent years, biometric authentication has been widely used in all walks of life. ere are mature technologies, such as fingerprint recognition, face recognition, and iris recognition [1], which have been applied in university classroom. Due to the loose discipline control in university classroom, with the popularity of mobile devices, students like to use mobile phones in class, which seriously affects the quality of learning. In order to improve the effect of classroom learning, we need to use face recognition technology. On the one hand, students' attendance in class can be realized. On the other hand, through the face images in the classroom, we can analyze the students' learning status and improve the students' participation in classroom learning. erefore, it is very necessary to use face recognition technology to realize class attendance and head-up rate recognition, monitoring teaching process, analyze learning and teaching situation, so as to improve methods and improve teaching quality.
In face recognition and behavior recognition, scholars have carried out a lot of research studies. At present, face detection methods are mainly divided into two categories: knowledge-based methods and statistics-based methods [2]. ese two methods extract the features of the region to judge by calculating the similarity of the features or the response value of the classifier. e knowledge-based method has the features of skin color, texture, structure, edge, shape, etc. Affected by different environments, the recognition accuracy of this method is quite different. However, statistics-based methods have been deeply studied, such as artificial neural network (ANN), AdaBoost method, support vector machine (SVM) [3], feature space method, long-term recurrent convolutional networks (LRCN), and convolutional neural networks (CNN) [4,5]. e network structures of these methods include AlexNet, VGG (visual graphics generator), and Inception Net model [6]. e methods based on feature space include principal component analysis (PCA), linear discrimination analysis (LDA), and local binary pattern (LBP) [7]. One of the common characteristics of these methods is to use the mapping of space vector in one feature space to distinguish face from nonface. Gabor, histogram of oriented gradients (HOG) [8], and scaleinvariant feature transform (SIFT) are also used to combine global features with local effective features to form the final features of face recognition. Face recognition is performed on learning video by the Fisher weighting criterion [9], and Gabor features and cooperative representation are combined. e face recognition algorithm (Gabor CRC) and speed are proposed [10]. In video image behavior recognition, face feature extraction and appearance expression are the important basis of behavior recognition algorithm. e common methods of appearance expression include contour template, light flow, and feature point. In the process of candidate region video image processing and feature extraction, color histogram, Haar feature or Haar-like feature, HOG feature operator [11], and wavelet algorithm [12] are usually used. en, machine learning classifiers, such as Softmax and SVM, and boosting or random forest classification algorithms [13][14][15], are used. e second is single-stage target detection algorithm, represented by single shot multi-box detector (SSD) [16][17][18][19] and YOLO [20,21], which is based on regression and classification of target detection, learning from the Faster R-CNN, sacrificing a little speed to further improve the accuracy. At present, YOLO has developed to YOLOv3. Compared with the RetinaNet [22] algorithm with the best accuracy before YOLOv3, under the same detection accuracy, the detection speed of YOLOv3 is 3.8 times that of the RetinaNet algorithm. Under the 320 × 320 resolution, compared with the SSD algorithm, YOLOv3 can detect map better than SSD, which can reach 28.2%. e processing time of each picture is 3 times faster than SSD, which only takes 22 ms. In [23], the SSD algorithm combines the advantages of YOLO and accurate positioning of the region proposal network (RPN), but the disadvantage is that the speed is slower than YOLO.
In view of the misdetection rate of occluded targets in pedestrian detection by the YOLOv3 algorithm, the YOLOv3 network structure is improved [24], which can enhance the ability of multi-scale feature fusion. Based on the fusion of GIoU and Focal loss, the YOLOv3 target detection algorithm is proposed [25]. An improved YOLOv3 network structure is designed [26], and through saliency mapping, the most significant part of the mesh is selected to detect the object. A new multi-sensor multi-level enhanced YOLO convolution network model is proposed for robust vehicle detection in traffic monitoring [27]. e micro-YOLO network model convolution layer is optimized by the deep separable convolution [28]. It decomposes a complete convolution operation into deep convolution and point convolution, which reduces the parameter of CNN and improves the speed of operation. A multiscale parallel network structure from dense to sparse is proposed for the face detection of different sizes in document [29]. But the recognition accuracy of multi-face targets in classroom and other places needs to be improved, because there are many faces, different sizes, and many priori bounding boxes for target detection in these occasions. Although most face targets can be detected, there are some missed detection rates, which makes some errors in the recognition accuracy of the class students' head-up rate.
In view of the above problems, this article proposes an improved YOLOv3 network structure based on Bayesian optimization for face recognition and face state analysis in class. e main contributions of this article are as follows: (1) On the basis of the Darknet-53 network structure, the deep separable convolution is used to improve the network structure, which can lighten the model and reduce the network computation. (2) Because of the students' different seats in the classroom, the size of the face target is different, so the feature pyramid is used to extract different scale features. In view of the diversity of the size of the face in the classroom, the attention guidance module is designed for multi-scale fusion, and it was combined with the method of DeepID network, so can recognize different sizes of the face targets.

YOLO Principle
YOLO is an end-to-end convolutional neural network for target detection. e YOLO grid cell has three prediction bounding boxes, which take the largest bounding box among intersection over union (IoU) of the current target box as the current target of prediction. e first two dimensions of the predicted output map are extracted feature dimensions. e third dimension is B * (5 + C), where B is the number of bounding boxes predicted by each grid unit and 5 is one confidence level plus 4 coordinates (x, y, w, h). e bounding box with confidence less than threshold value set to 0 C is the number of bounding boxes. e output characteristic diagram contains parameters to be optimized for loss function. Finally, the nonmaximum suppression (NMS) is used to remove the repeated bounding box to detect various targets, as shown in Figure 1.
In 2015, Joseph Redmon et al. proposed the YOLOv1 network structure [30], which drew on GoogleNet thought, including 24 convolution layers and 2 full connection layers. In the YOLOv1 network, the last output layer uses linear activation function, and Leaky ReLU activation function is used after each convolution layer and full connection layer. Because the network structure contains the full connection layer, the problem of small target leakage in YOLOv1 is caused.
In order to focus on solving the errors in the recall rate and positioning accuracy of YOLOv1, an improved YOLOv2 detection algorithm was proposed in 2016, which uses darknet-19 as the feature extraction network [31]. A variety of strategies, such as using anchor box, multi-scale training, and batch standardization processing, were proposed to improve the mean average precision (mAP) detected by the algorithm. Finally, the global average pooling layer is used to replace the full connection layer for prediction, and the mAP of the algorithm is increased by 3.7%.
In 2018, the YOLOv3 detection algorithm was further improved [32]. Based on the Darknet-53 and feature pyramid networks (FPN), multi-scale detection is carried out on threescale feature graphs, which improves the detection ability and accuracy of small targets. YOLOv3 has no full connection layer, and it is a full convolution network. Softmax is replaced by multi-label classification. e detection speed of the algorithm YOLOv3 is 78.6% when the detection speed is 40 Fps (frames per second) on the VOC207 dataset. On the COCO dataset, the detection speed of 20 Fps can be maintained when the mAP reaches 57.9% (IoU � 0.5). In this article, the HD camera used for data collection is 1280 * 960, which is 1.3 million pixels.
YOLOv3 uses clustering to obtain anchor frames and initializes three anchor frames on three scales. During prediction, each grid will predict 3 bounding boxes, each of which contains 5 parameters and probability of each category. e YOLOv3 network combines multi-scale features, so the detection accuracy and capability of small targets are improved.
e prediction method of the bounding box coordinate of YOLOv3 is as follows: where σ is the sigmoid activation function; t x , t y , t w, and t h are the prediction output of the model; and c x and c y are the grid cell coordinates. p w and p h are the size of the bounding box before prediction, and t o is the confidence level of YOLOv3. b x , b y , b w , and b h are the center coordinates and dimensions of the bounding box obtained from prediction. Pr (object) indicates whether there is a target in the bounding box of the current grid. If it exists, the value is 1; otherwise, 0 is taken. IoU (b, object) is IoU loss function, which represents the distance between the target and the center of the anchor frame. e K-means clustering method is used. e YOLOv3 loss function consists of four parts: the center error of the bounding box, the width and height error of the bounding box, the classification error, and the confidence error. Among them, the center error of the bounding box and the width and height error of the bounding box are calculated by sum variance. Suppose there are a set of samples y i and its estimated value, y i , i � 1, 2, . . . , n, then the sum variance is as follows: Classification error and confidence error are calculated by using the calculation method of binary cross-entropy loss, as shown in the following formula: e error of the width and height of the bounding box is increased by a scale factor ε.
e calculation formula is shown as follows where w and h are width and height, respectively. Because the scale factor can adjust the regression loss, it can detect small targets better.

Improved YOLOv3 Algorithm
3.1. Improving the Network Structure of YOLOv3. In the classroom video, the number of students is large, and the face sizes are different in different scenes. Especially for the students sitting in the back row, the face is small, and Journal of Electrical and Computer Engineering sometimes, it will be missed. erefore, it is necessary to further improve the YOLOv3 algorithm, on the one hand, to build the amount of computation and, on the other hand, to improve the accuracy of face recognition. e improved YOLOv3 network structure introduces feature pyramid networks (FPN) for multi-scale detection. Deep features are used to detect large targets in the network, and shallow features are used to detect small targets, which effectively improves the detection ability of small targets. e improved network structure of YOLOv3 is shown in Figures 2 and 3.
From Figure 3, the YOLOv3 algorithm has been improved as follows: (1) e dependence of gradient on parameters and its initial value scale is reduced. e network training can be avoided by using a large learning rate and batch normalization (BN). BN improves the generalization ability of network and reduces dropout and optimizes network structure. Spatial pyramid pooling (SPP) is used instead of average pooling at last, which reduces the adverse effect of average pooling on network performance. (2) In this article, the lightweight network model MobileNet is used for reference, and the depth separated convolution (DSC) is used to reduce the parameters and computation. e width multiplier and resolution multiplier are used to achieve an effective tradeoff between classification accuracy and speed. In view of the different sizes of different targets, combined with the DeepID face recognition algorithm, the attention guidance module is designed to further carry out multi-scale fusion and strengthen the relationship between different size eigenvalues. (3) In order to solve the problem of unbalanced distribution of positive and negative samples, simple samples, and difficult samples, the super parameter of scaling factor is added and the loss function is improved. (4) Bayesian is used to optimize the classifier, so as to improve the efficiency and accuracy of classification.

Deep Separable Convolution.
In order to simplify the network model, width multiplier α is introduced to act on the number of channels to reduce the amount of parameters and calculation. e number of channels M in the input layer becomes αM, and the number of channels N in the output layer becomes αN. e total calculation amount of deep separable convolution N DWS-α and the parameter P DWS−α is shown in the following formulas, respectively: where the value range of α is (0, 1], and the parameters and calculation amount of the model are reduced by the order of α 2 . e resolution multiplier β is introduced to act on the input features, thus reducing the amount of calculation. e input feature changes from H × W to βH × βW. After introducing β, the total amount of computation of deep separable convolution is N DWS-β , as shown in the following formula: where the value range of β is (0, 1], which is generally used implicitly by setting the input resolution. It can be seen from formula (7) that the calculation amount is reduced by β 2 , and the parameter quantity is independent of the super parameter.

Improved Loss Function.
In single-stage target detection, because the distribution of positive and negative samples is extremely unbalanced, the loss of target detection is easily submerged by a large number of negative samples. e key information provided by a small number of positive samples cannot play a normal role in the loss function, so it is impossible to obtain a loss function that can provide correct guidance for model training. Increasing the weight of crossentropy loss can effectively solve the problem of sample imbalance. e typical cross-entropy loss is widely used in image classification, p ∈ [0, 1], which is given as where p represents the output class probability of the model and y is the class label. Formula (8) is improved, supposing Considering the imbalance of positive and negative samples in the dataset, the weight is increased by using the coefficient, which is inversely proportional to the probability of the target. In order to reduce the loss of simple samples automatically, a dynamic scaling factor super parameter λ is added to correct the cross-entropy loss by using the weight coefficient c, which is inversely proportional to the probability of target existence, which is given as e key is to calculate the conditional probabilities and establish the training sample set firstly. e guiding principle of Bayesian classification is that if the probability of feature X belonging to pattern class c 1 is greater than that of feature X belonging to pattern class c 2 , then the decision pattern belongs to pattern class c 1 , and on the contrary, the decision pattern belongs to pattern class c 2 .
If P(X|c 1 ) > P(X|c 2 ), then x ∈ c 1 , and if P(X|c 1 ) < P(X|c 2 ), then x ∈ c 2 . When there is an unknown  Journal of Electrical and Computer Engineering 5 data sample vector X, the Bayesian method calculates the maximum category of posterior probability. e Bayesian formula is as follows: Since P(X) is a constant, only P(X|c i )P(c i ) needs to be calculated when calculating the posterior probability. P(c i ) � N/N c , where N is the number of training samples, and N c is the number of training sample categories. e conditional probability estimates of each feature attribute in each category are calculated and recorded. en, the conditional probability estimates of each feature component under the two categories are calculated. e formula for calculating P(X|c i ) is as follows.
For the unknown sample X category, P(X|c i )P(c i ) of each category is calculated. e category with the highest probability is the prediction category of sample X, that is, C � arg max P c i n j�1 P x j |c i . (13)

Analysis of Experimental Results
In order to verify the feasibility of the algorithm, an experimental system is built. e system version is Ubuntu 16.04LTS (64 bit), the CPU is Intel i7-9750, the graphics card is GeForce GTX1060, and the memory is 8 g. Using the Darknet learning framework, the running environment of the program is Python 2.7. is article designs five groups of comparative experiments in VOC2012, COCO datasets, and self-built classroom datasets.

Comparative Experiments of Different Basic Networks.
For the algorithm basic network, the evaluation indexes are Top1 and Top5 error rates. e lower the index value is, the better the classification accuracy of the model is; that is, the better the model is.
e VOC2012 dataset is used as the test dataset, and the comparison of calculation results of different basic networks is shown in Table 1.
It can be seen from Table 1 that the error rate of the improved YOLOv3 operation is reduced, and the deep separation convolution operation can greatly reduce the flops and complexity without losing the model capacity. Compared with GoogleNet and Darknet-53, the Top1 error rate of the model is reduced by 1.1% and 0.13%, respectively, and the Top5 error rate of the model is increased by 1.8% and 0.45%, respectively, and the parameters and FLOPs (floating point operations per second) is 90.4% and 89.1% of GoogleNet.
Taking COCO dataset as an example, the comparison of calculation results of different methods is shown in Table 2.
It can be seen from Table 2 that compared with other methods, the method can increase the depth of convolution and attention guidance module, which makes the detection speed and parameter amount slightly decrease. However, under the same IoU, the leakage rate of this method is the smallest and the average accuracy is higher than 1.2% of other methods, which indicates that the method has certain advantages.
In the VOC2012 dataset, FPN comparative experiments are carried out to verify the improvement of detection effect by adding the FPN algorithm in YOLOv3. e mAP and Fps of the target detected by the algorithm without FPN are 32.6 and 57.2, respectively, whereas the mAP and Fps of the target detected by the algorithm with FPN are 34.8 and 55.4, respectively. Using the FPN-improved YOLOv3 method for multi-scale detection, due to the addition of two scales and attention guidance module to assist in feature enhancement, the detection accuracy of the algorithm is improved to a certain extent. Compared with the method without FPN, the mAP index is improved by 2.2%, and the Fps is reduced by 1.8 due to the use of larger features, but the detection speed is still high.

Improving the Comparison Experiment of Loss Function.
e loss functions commonly used in SSD, YOLOv1, and YOLOv2 models are used in the experiment. e improved local loss function in this article is proposed and tested under the IoU index of 0.5 and confidence of 0.5 in the VOC2012 dataset. e average accuracy of the test is shown in Table 3.
It can be seen from Table 3 that the detection accuracy index mAP of the improved local loss function is enhanced, compared with the conventional loss function and IoU loss function used in YOLOv3. Compared with the original loss function, mAP 0.5 and mAP of the improved algorithm are improved by 2.3% and 2.7%, respectively.
It can be seen from Figure 4 that the improved loss function tends to zero after 500 iterations, and the accuracy is higher than loss and IoU.

Performance Comparison Experiment of Different Test
Methods. Each target detection algorithm is tested on VOC2012 dataset and COCO dataset, and the ROC curve and recall curve are used for comparative analysis.
It can be seen from Figure 5 that the recognition accuracy and success rate of this algorithm are higher than those of other methods, due to using attention guidance module and Bayesian Optimization classifier. e ROC curve area of this algorithm is 0.845, which is 0.015 more than the SSD algorithm and 0.01 more than the YOLOv3 algorithm. e recognition success rate is 83.7%, which is 1.5% and 0.6% higher than the SSD algorithm and YOLOv3 algorithm, respectively. e recall rate is analyzed in VOC2012 dataset, and the curve is shown in Figure 6. As can be seen from Figure 6, this method adopts the improved YOLOv3 algorithm to realize    Journal of Electrical and Computer Engineering the recognition of multi-scale face target and uses Bayesian to optimize the classification method. So the recall rate is higher than other methods, which is about 4% and 1.4% higher than that of SSD and YOLOv3, respectively. e test results in the COCO dataset are shown in Table 4.
From the experimental results in Table 5, it can be seen that the above algorithms can better complete the task of student detection. Fast R-CNN algorithm is not accurate in positioning and will miss detection. e accuracy of the YOLO algorithm is very high, but there will still be some missed detection. Compared with the traditional YOLO method, YOLOv3-based Bayesian optimization can increase the depth separation convolution and feature pyramid and improve the recall rate of 1.8% and the accuracy of 2.21%, but the average detection speed is reduced by 0.4. is algorithm uses the DeepID face detection model for reference, which reduces the average detection time, but the accuracy and recall rate are improved, and the comprehensive performance of the improved YOLOv3 algorithm is enhanced.

Image Scaling Experiment.
Firstly, the original image (1280 * 960) is compressed into a low-pixel image (320 * 240) from the video sequence, and then, the low-pixel image is reconstructed into (640 * 480), (960 * 720) size images by using the algorithm in this article. By comparing and predicting the number of face frames, the face detection recall rate under different image scales is calculated.
As can be seen from Figure 7 and Table 6, the original image pixel is 1280 * 960, and the face detection recall rate is 98.5%. e compressed low-pixel image pixel is 320 * 240, and the face detection recall rate is only 25.6%. e experimental results show that with the increase of the image reconstruction scale, the face detection recall rate increases from 25.6% to 98.5%.

Experimental Analysis of Self-Built Data Set.
e test effect of the self-built classroom dataset is shown in Figure 8. e experiment adopts a medium-sized classroom scene, the length and width of the classroom is 18 * 9 meters, which can accommodate up to 120 students. e  Journal of Electrical and Computer Engineering shooting position is located at the top left of the platform, facing the students. Considering the angle, the distance between the students and the camera is about 3 m∼18 m.
During face recognition, we need to consider the environment, light, distance factors, the farthest, the face size of students is small. However, the size of face has less impact on face detection, but more impact on face recognition and head-up rate. Because face detection is a two-classification problem, face recognition is a multi-classification problem. Affected by the environment, the algorithm is more complex, and the lack of information will have a great impact on face feature extraction. Firstly, the face state analysis experiments of multiple groups of classrooms were carried out, with 110 people as the benchmark. e statistical results are shown in Figure 9. It can be seen from Figure 9 that only 50.91% of the total number has good detection and recognition conditions in a large classroom. Too much illumination will lead to face    features covered by strong light, which is more common in outdoor scenes; occlusion will lead to incomplete face features, and only part of the face features can be obtained. Bow head and small face also account for 26%. In five self-built classroom datasets, the number of students is 45 and 100. e actual number of students in 45student classroom is 37 and that in 100-student classroom is 75. e test results are shown in Table 6.
It can be seen from Table 5 that the accuracy of this method is higher than that of other methods. In the 45student classroom, this method is 5.4% higher than the SSD method and 2.71% higher than the YOLOv3 method. In the 100-student classroom, this method is 5.33% higher than the SSD method and 1.33% higher than the YOLOv3 method. Although the detection speed of this method is slightly lower than that of fast R-CNN, SSD, and YOLOv3, it is still faster than other methods.

Comparison with Other Neural Network Depth
Quantization Methods. Furthermore, in order to verify the training ability of the model, the image prediction method proposed in this article needs to be compared with other model prediction methods. In automatic reinforcement learning, quantization is widely used as an important means of image compression, and the contradiction between bit width and accuracy always exists. At present, the Bayesian optimizer proposed in this article determines the bit width. Another problem is the selection of quantization value, which is obtained by alternating training in LQ-Net. Deep reinforcement learning (RL) is good at mapping original sensory input to action [33], while AutoRL is an evolutionary automation layer around deep RL, which uses large-scale hyperparametric optimization to search reward and neural network architecture. e AutoRL method has been used in long-distance robot navigation, multi-stage prediction of microgrid, and so on [34].
In order to verify the effectiveness of this method, the quantization bit width of automatic reinforcement learning is determined by using ReLeQ, AutoRL framework, and this method separately. e experiments in three experimental data sets are shown in Table 7.
It can be seen from Table 7 that the Bayesian optimized image-type prediction method proposed in this article can determine the quantization bit width, and the network recognition accuracy is slightly higher than that of the AutoRL method, indicating the superiority of this method in image classification prediction.
Furthermore, extreme learning machine (ELM) is based on feed forward neural network (FNN) [35], or an improvement of FNN and its back propagation algorithm. Its characteristic is that the weight of hidden layer nodes is random or artificially given, and does not need to be updated. e learning process only calculates the output weight. ELM has strong generalization ability and high accuracy in approximating datasets. It  has been successfully used in the transformation from lowresolution to high-resolution images. In the COCO dataset, the model prediction accuracy of this method, and methods described in literature [35] and literature [36] has reached 96.5%, 95.9%, and 96.3%, respectively, indicating that the image prediction method optimized by Bayesian can achieve high accuracy.

Conclusion
In order to solve the problem of low accuracy of the traditional object detection model, a new YOLOv3 model based on Bayesian optimization is proposed. e depth integral separation convolution is used to replace the standard convolution for information fusion, which reduces the amount of network structure parameters and calculation. e feature pyramid is used to replace the network full connection layer, and the attention guidance module is designed to enhance the multi-scale feature fusion ability and reduce the overfitting weight coefficient. A dynamic scaling factor super parameter is added to the loss function to improve the imbalance between positive and negative samples, and the Bayesian method is used to optimize the classification. Compared with the traditional YOLOv3 algorithm, the recall rate of the optimized YOLOv3 model is improved from 94% to 95.8%, and the average mAP value is increased by more than 1.2%. It shows that the optimization method of the training process in this article enhances the training effect of the model. e algorithm in this article realizes the synchronous improvement of the detection accuracy and improves the real-time detection of multi-face small target in the classroom.

Future Prospects.
Recently, YOLOv4 and YOLOv5 are released, and these algorithms will be the next research goal of our team. We will combine with more efficient network structure and model to research deeply, and the research results will be applied to classroom face recognition, driving safety behavior recognition and other aspects. [36]

Data Availability
At present, the data are still in the experimental stage and cannot be disclosed temporarily.

Conflicts of Interest
e authors declare that they have no conflicts of interest.