Sharpness and Brightness Quality Assessment of Face Images for Recognition

Face image quality has an important effect on recognition performance. Recognition-oriented face image quality assessment is particularly necessary for the screening or application of face images with various qualities. In this work, sharpness and brightness were mainly assessed by a classification model. We selected very high-quality images of each subject and established nine kinds of quality labels that are related to recognition performance by utilizing a combination of face recognition algorithms, the human vision system, and a traditional brightness calculation method. Experiments were conducted on a custom dataset and the CMU multi-PIE face database for training and testing and on Labeled Faces in the Wild for cross-validation. -e experimental results show that the proposed method can effectively reduce the false nonmatch rate by removing the low-quality face images identified by the classification model and vice versa. -is method is even effective for face recognition algorithms that are not involved in label creation and whose training data are nonhomologous to the training set of our quality assessment model. -e results show that the proposed method can distinguish images of different qualities with reasonable accuracy and is consistent with subjective human evaluation. -e quality labels established in this paper are closely related to the recognition performance and exhibit good generalization to other recognition algorithms. Our method can be used to reject low-quality images to improve the recognition rate and screen high-quality images for subsequent processing.


Introduction
Extensive research on face image quality (FIQ) has shown that samples given as inputs to an automated recognition system influence recognition performance. Face recognition has been increasingly applied in uncontrollable environments (e.g., automated security checkpoints) where the acquired images may include blur, uneven illumination, and nonfrontal poses. Such nonideal factors can significantly decrease the recognition accuracy. e most direct manifestation of this decreased accuracy is that the face recognition performance of the same recognition algorithm on datasets with different qualities has obvious differences. For example, Aghdam et al. [1] used several models to prove that the recognition performance of the same recognition model can differ by 70% or more on data of various qualities captured in the same scene. Some researchers have proposed effective methods to solve the problems caused by nonideal factors in recognition. For example, Cao et al. [2] proposed a posture robustness recognition algorithm, and Fekri-Ershad [3] classified face gender to help improve the recognition rate. ese methods have achieved some results. However, filtering low-quality images by face image quality assessment (FIQA) is also an important way to improve the performance of recognition systems. e quality of face images as biometric samples is closely related to the recognition result. ree characteristics of FIQ have been described in standard ISO/IEC 29794 [4]: (1) the character, which indicates the attributes associated with an inherent characteristic; (2) the fidelity, which reflects the degree of similarity with the source biometric characteristic; and (3) the utility, which indicates the fitness for recognition and is influenced by the character and fidelity. FIQ is defined as a measure of the utility of a face image to face recognition systems [5][6][7]. is definition is consistent with the utility described above. An FIQ measure can essentially be considered a predictor of face recognition accuracy. In other words, a face image determined to be of high quality should enable recognition systems to succeed (or vice versa). e ultimate goal of FIQA is to exploit the relationship between image quality and the output of recognition algorithms. e FIQA has great practical value because it can screen face images of various qualities, whether it is applied to realtime recognition systems online or offline face image applications. Restricting face images, which are determined to be of poor quality for recognition, can improve the recognition performance and simultaneously reduce the waste of face recognition system resources. Some adjustment instructions can be provided to persons being identified or staff according to the quality of the image, which has guiding significance for effective dynamic adjustment of the face image acquisition environment. FIQA for images that have failed to be identified can provide feedback to recognition algorithm researchers who are purposefully improving the recognition performance. e development of multirecognition algorithm systems can be promoted by selecting appropriate recognition algorithm configurations based on the image quality so that the recognition system can utilize images of different qualities. Image enhancement can be promoted by selectively enhancing the image or choosing different enhancement configurations for images of different qualities. In addition, FIQA can be applied to quality-based fusion, database maintenance [7], and dynamic recognition approaches [8,9].
One of the challenges of face image quality evaluation is that the FIQA output should be closely related to recognition. Recently, some studies have evaluated specific factors, such as clarity, and combined the evaluation results of each factor to obtain the OQ [10,11]; however, these methods are not closely related to recognition performance. Researchers have proposed deep learning methods to predict quality using the similarity score of two images of a given individual as labels [5]. Although these methods have been used to achieve some breakthroughs, there is still a lack of identity-oriented methods that do not rely heavily on recognition algorithms.
In this work, experiments are conducted on a database (denoted the SC database) of images that were collected in identification channels. Because the people are ready for recognition in this scene, the captured pictures are mostly frontal portraits without occlusions but include light distortion by uneven light and blur due to the transitions between the identified persons. We mainly conduct a composite assessment of face image brightness and sharpness by supervised deep learning methods on these images. We also use the same method to perform experiments on the CMU multi-PIE [12] face database (M-PIE) data and crossvalidation on Labeled Faces in the Wild (LFW) [13]. e main contribution of this article is as follows: (i) the effects of brightness and sharpness on recognition are simply verified on the M-PIE, and suitable very high-quality images (VHQI) per subject for identification are selected with International Standard [4,14] and human consensus [7] before image labeling. (ii) We establish brightness and sharpness labels associated with identification. As a result, the images are divided into nine categories that represent varying degrees of brightness and sharpness. (iii) A classification model is trained to predict quality based on the self-built SC database and established quality labels, and the quality of the classified data is verified. In particular, the network structure is derived from the literature [15] and improved. e method for establishing the labels is different from the method that uses only the similarity score, which depends seriously on a recognition algorithm and uses only subjective assessment, deviating from the recognition in this paper. e trained model can predict which class the image belongs to, where the classes represent different levels of brightness and sharpness.
is paper is organized as follows: Section 2 surveys the quality assessment methods for face images. Section 3 describes the materials and methods, including face databases and preprocessing, the method of selecting the VHQIs and establishing quality labels, and the network structure. Experimental settings and results are provided in Section 4. Section 5 presents a concluding summary of this work and directions for future work.

Related Work
FIQA is a branch of image quality assessment (IQA) but is also an extension of image quality. IQA can be subdivided into (i) full-reference (FR) [16,17], (ii) reduced-reference (RR) [18,19], and (iii) no-reference (NR) [20][21][22][23] categories according to the amount of information provided by the reference image. FIQAs also include FR-based approaches; for example, there is relevant literature [24][25][26] that reports the use of computing luminance distortion, structural similarity (SSIM), and probabilistic similarity to reference face images. However, FR and RR methods are not easy to apply because of the difficulty in obtaining undistorted reference images. Studies of NR-IQA are necessary. e FIQAs described below are all based on NR. FIQAs can be categorized into non-deep learning (non-DL FIQA) and deep learning (DL FIQA).
Non-DL FIQAs mostly assess specific factors, such as sharpness, occlusion, pose, symmetry, expression, illumination, and resolution by defined methods. One of the early studies proposed by Gao et al. [27] demonstrated the assessment of symmetry for light and pose, eye distance, illumination, contrast, and blur. Another method for evaluating symmetry proposed by Zhang and Wang [10] is based on local scale invariant feature transform (SIFT) features. Sang et al. [28] also evaluated symmetry through illumination and pose based on a Gabor filter and measured blur by a discrete cosine transform (DCT) and inverse DCT. In the literature [29], researchers have employed DCT to evaluate sharpness. Nasrollahi et al. [30] utilized the least out-of-plane rotated (LOPR) faces method to evaluate posture. Furthermore, overall quality (OQ) is always obtained by combining the evaluation results of each factor. Nasrollahi and Moeslund [30] also measured the illumination, blur, and resolution and performed weight fusion to obtain the OQ. A similar method exists in the literature [31]. Chen et al. [32] divided images into three categories: nonface images, unconstrained face images, and identification (ID) card face images, and assumed that these ranks gradually increase. Rank-based OQ normalized to [0, 100] is acquired by five feature fusions and is applied to learn rank weights.
DL FIQAs have emerged in recent years and are almost supervised. Zhang et al. [33] created the Face Image Illumination Quality Database based on human assessments and trained a model based on ResNet-50 [34]. e experimental results show that the predicted illumination quality is closely related to the labels defined by humans but lacks a relationship between the predicted quality and the recognition performance. Rowden and Jain [5] established quality labels for the LFW training database through the two methods of human assessments and matcher dependence. Given established target face quality values, a support vector model was trained on face features extracted by a convolutional neural network (CNN) to predict the quality of the face images. Yu et al. [11] synthesized 5 degradations (nearest-neighbor downscaling, Gaussian blur, additive white Gaussian noise, salt-and-pepper noise, and Poisson noise) with 3 configurations on the CASIA WebFace [11] and trained a classification model on 16 classes of images, including the original unmodified image and 15 synthetic degradation images. OQ scores were obtained by pooling 16 products of the image degradation classification confidence and the face image recognition accuracy under the corresponding degradation. In the literature [35], a two-stream CNN named "deep face quality assessment (DFQA)" was proposed. Yang et al. [35] divided the quality scores into 5 segments, which were categorized by angle, clarity, illumination, visibility, expression, etc., and established manual labels for 3000 images from ImageNet to train the pretrained SqueezeNet model. e DFQA was trained to predict OQ scores on the MS-Celeb-1M [36] dataset with quality labels produced by the pretrained model. Hernandez-Ortega et al. [37] proposed FaceQnet based on ResNet-50 for quality learning on a 300-subject subset of VGGFace2 [38]. e quality labels in this experiment are comparison scores derived from multiple feature extractors between the probe images and high-quality images selected by the BioLab-International Civil Aviation Organization (ICAO) framework. Zhang et al. [39] and Zhuang et al. [40] utilized a multitask structure with several factors and OQ labels that were established by humans and a related algorithm for 3000 images from the Intelligence Advanced Research Projects Activity (IARPA) Janus Benchmark-A [41] (IJB-A) dataset. e features extracted from the front shared layers are set as the subtask layer inputs for predicting various quality factors, such as pose. e outputs of the subtask are fused to produce an OQ score via fully connected layers. Unsupervised methods, including SER-FIQ [42] and MagFace [43], have emerged in the last two years. SER-FIQ uses stochastic embedding robustness to estimate face image quality. MagFace obtains the quality scores by learning a universal representation of face recognition and quality assessment.
In this work, we combine the similarity score of face recognition algorithms, the definition grade classification method of the human visual system, and the traditional brightness classification method to establish brightness and sharpness labels associated with identification. Furthermore, given that we have established FIQ labels for a self-built database, we train a classification model based on Mobile-NetV3 [15], which can predict a face OQ that simultaneously represents the brightness and sharpness rank. To our knowledge, this is the first attempt to combine recognition performance with human assessments for FIQ labels.

Face Databases and Preprocessing.
is work utilized three face databases: the M-PIE, a self-built SC database, and the LFW. e M-PIE was collected under an environment with strict lighting, posture, and expression control in four sessions over a five-month period; these data consist of 337 subjects and more than 750,000 highresolution face images.
e SC database consists of approximately 5000 face images of 945 subjects selected from identification channels of Wisesoft Co., Ltd. e subjects were employees of the company and agreed to the use of their images in the study. e specific screening methods will be described later. e images in LFW were derived from natural scenes in life, and a total of 13,233 images of 5,749 subjects were included, of which more than 70% of subjects had only one image. e M-PIE contains face images that were acquired under the condition that only one factor changes, while the other factors remain optimal. For example, when capturing images under different lighting conditions, the face remained in a frontal posture with a neutral expression. We extended the M-PIE to 9 classes of data similar to the SC database. e experiments were trained using the SC database and M-PIE and then evaluated on the LFW and subsets of the SC database and M-PIE other than the training set. e prediction results on the LFW dataset were used to see how the evaluation results correlate with the human visual system and recognition performance.
In this work, all images were detected, and five key points (pupils of two eyes, nasal tip, and two corners of the mouth) were marked by a model based on a multitask convolutional neural network (MTCNN) [44] that included only the convolution layer in the first stage; thus, the input of the model was not limited to a defined size. MTCNN mainly adopted three cascade networks: the proposal network (P-NET) for rapidly generating candidate windows, the refine network (R-NET) for high-precision candidate window filtering and selection, and the output network (O-NET) for generating final boundary boxes and face key points. O-NET was a regression task that minimized the Euclidean loss of the facial landmark coordinates (y landmark i ) obtained from the network and the ground-truth coordinate (y landmark i ) for the i − th sample. Euclidean loss is as follows: (1)

Scientific Programming
In the literature [45], Best-Rowden divided FIQ into three scenarios: (i) whether an image contains a face, (ii) evaluation of the accuracy of face alignment, and (iii) the quality of an aligned face image. We will discuss the third scenario. Face images need to be preprocessed to align the faces as much as possible in the process of face recognition. e input image evaluation of face recognition systems has more practical significance; therefore, it is necessary to carry out the same image pretreatment as that used for recognition in FIQA. Based on the key points detected, the process of pretreatment was as follows: first, the midpoint denoted P1 between the two pupils and the midpoint called P2 between the two corners of the mouth were found. en, we connected P1 and P2 to obtain line segment L, calculated the angle between L and the vertical line as the rotation angle denoted by θ, and rotated the face clockwise or counterclockwise by θ so that all faces had the same posture on the plane. Finally, the image of the face was magnified or reduced to the specified size. Specifically, we scaled each image to 150 × 150 pixels.

Brightness and Sharpness Factor Verification.
is paper focuses on the brightness and sharpness of the image, and we use specific data to illustrate the degree of influence for the recognition of these two factors before introducing the method for establishing quality labels. e M-PIE contains images taken under 19 light conditions, where other quality factors are optimal (see Figure 1). is database is suitable for verifying the influence of individual factors on identification and is therefore chosen to verify the effect of sharpness and brightness on recognition. We tested the recognition of images in different lighting environments with a classical face recognition algorithm (FRA-A) based on a Light CNN-9 [46] with max-feature-map (MFM) units. We know that the human visual system is very accurate at recognizing people, as it is even better than current state-ofthe-art recognition systems [47,48]. Similarly, some studies [5,31,49] have verified the usability of the human visual system in FIQA. In the following work, we used the human recognition system to assist in the selection of images and the establishment of labels. e M-PIE does not contain offlight images. erefore, we brightened some of the images by a power exponential operation via image transformation to verify the quality of the off-light images. Gamma (G) parameters of 0.14 and 0.28 were selected to augment images called Bri0 * and Bri1 * , respectively (see Figure 2). Usually, to minimize the error caused by labeling single images based on the similarity scores (SS) determined for a pair of images, it is necessary to select suitable VHQI per subject for identification or verification. It is also necessary to choose the images with the most appropriate brightnesses as the VHQI and then test the verification accuracy (VA) of images with different brightnesses.
According to the given brightness indicator (last two digits of the image file name) of M-PIE and human visual perception, brightness images (filename with "06∼08," named Bri2) with high VAs were selected. Samples of darker images were gradually added to the previously selected samples. After the addition of some dark images, the VA changed very little, so we added more dark samples to simultaneously carry out the FRA-A test and obtained the results in Table 1. e identification of each type of test image is listed as follows: Bri2 (06∼08), Bri3 (05∼09), Bri4 (05∼09 and 15∼17), Bri5 (04∼11 and 14∼18), Bri6 (02∼18), and Bri7 (0∼19). Table 1 shows the VAs for these types of images, where Bri1 represents the two types of images Bri2 and Bri1 * and Bri0 represents the three types of images Bri2, Bri1 * , and Bri0 * . e VA of Bri2 was the peak and was clearly higher than the VA of Bri1. e VA of Bri3 was very close to the result of Bri2 and exhibited a significant decrease compared to the VA of Bri4.
us, Bri3 was chosen as the VHQI. Images of similar brightnesses, which are marked as Bri4 * (15∼17), Bri5 * (04, 10∼11, 14, and 18), Bri6 * (02 and 03), and Bri7 * (00, 01, and 19), were paired with the same person in the VHQI before testing. When comparing model performance, the larger the area under the receiver operating characteristic (ROC) curve (AUC) is, the better the model effect will be. Retaining the recognition algorithm unchanged, the AUC was proportional to the quality of the image. Figure 3 shows the ROC curves for different brightnesses under FRA-A. Because the luminance was the only distortion, there were several types of data with good and very similar recognition rates. To show the classification effect of each type of data more clearly, the vertical and horizontal coordinates were adjusted. Figure 3 shows that the recognition performance of the Bri3 * images is the highest, and the recognition rate decreases gradually with brightening and dimming.
To study the influence of sharpness on the recognition rate, we synthesized four degrees of blurred images (Blu1∼Blu4) with motion blur and tested the images. Blur was added by convolving an image with a kernel, which was obtained by an affine transformation of the rotation matrix generated by the size of the kernel (K) and the rotation angle (45°). e original image and the composite image are shown in Figure 4. Table 2 and Figure 5 show the test results. With the reduction in clarity, the recognition rate of each type of data decreased significantly, and Blu4 was completely unsuitable for recognition.

VHQI of per Subject.
In this work, we selected the VHQI of each subject in the SC database with high definitions, suitable brightnesses, no occlusions, frontal poses and neutral expressions using face recognition algorithms, human vision systems, and traditional brightness calculation methods. e specific processes are as follows: Low-quality face images with interference factors (a nonfrontal pose, an occlusion, and a nonneutral expression) from the original image set Q0 were excluded as much as possible through the human visual system to obtain image set Q1.
High-definition images denoted by Q2 were manually screened from image set Q1 by two persons. e specific screening principle was based on the absolute scale of the subjective evaluation method [50], as shown in Table 3.
When images are rated as 5 points, they are classified into Q2.
Images of different brightnesses in Q2 were selected for testing to determine the appropriate brightness. We cropped the 96 × 96 face area from the center of the face image to reduce the background influence. e brightness of the face area was determined by the distribution of gray values. Assume that the number of pixels whose gray value

Raw
G=0.14 G=0.28 image. e specific step is to scale each pixel of the original image to [0, 1] and then transform it by E � R G . e specific gamma parameters 0.28 and 0.14 were selected to produce images with brightness intervals that can be discerned by the human visual system. Scientific Programming was between v and v 1 (where v was less than v 1 ) was m, and the total number of pixels was n. P i (m/n) was the proportion of pixels in the defined brightness interval. e values of v, v 1 , and P i were the brightness parameters to be determined. An initial brightness range [k 0 , k 1 ] and P i were chosen manually. en, we continuously adjusted the brightnesses of the images, assuming that the brightness adjustment range was [k 0−p , k 1+q ], where p and q were positive parameters. After each adjustment, a face recognition test was carried out on multiple images of the same person under a certain brightness, where n is the number of adjustments, and the VAs va 1 , va 2 , va 3 , . . . , va n were obtained when the FAR was equal to 0.01% with different brightnesses. Appropriate brightnesses appeared when the VA began to evidently change; that is, the brightness corresponding to the difference between the VA and all previous VAs was less than or equal to α, and the difference between the VA and all subsequent VAs was greater than or equal to β (α < β).
e visual explanation is shown in Figure 6, in which VA and the changing trend of VA are hypotheses for interpretation. We obtained high-resolution images with good brightness (v is 90, v 1 is 200, and P i is 0.65) and called them Q3. e VHQI of each object was obtained by testing multiple images of the same object in Q3 and selecting the top 80% of image pairs that were ranked when the similarity score was higher than the corresponding threshold value at an FAR of 0.01%. As we expected, the SS of image pairs were basically higher than this threshold value. e flowchart for determining VHQI(s) is shown in Figure 7.

Establishment of Quality Labels.
On the basis of establishing VHQI, we employed FRAs trained on a selfbuilt database, including four million images and manual evaluation to establish quality labels for the SC database. e above experiments demonstrate that sharpness is more sensitive to recognition, so we used the recognition rate to assist in the classification of sharpness. en, we classified the brightness of the data with different sharpness.   Scientific Programming e dark (v was 0, v 1 was 80, and P i was 0.75) and bright ranges (v was 150, v 1 was 255, and P i was 0.75) were determined in a similar way. e differences among the above methods were that the selected images of different brightnesses were formed into positive samples with the standard images, and p and q were determined according to the recognition results.
Face image classification should have a corresponding significance in the category of biometric sample quality. e quality of biological samples can be divided into three categories. (i) low-quality samples (LQS) that cannot be used for identification or may produce poor identification results. If possible, these samples should be replaced with highquality samples. (ii) Medium-quality samples (MQS) that may yield good certification results in most environments, but in requirements-based applications, it is necessary to include high-quality samples. (iii) High-quality samples (HQS) that can produce good certification results under any circumstances. e face images without VHQIs were divided into three categories according to the SS, where each category represents the corresponding significance of the quality category for biometric samples. If a subject has multiple VHQI, the similarity score of an image built with a label is the average similarity value of all corresponding VHQIs. Images with SS below threshold 1 (T1) are defined as LQS. Images with SS above threshold 2 (T2) and below threshold 3 (T3) represent MQS. Images with SS above T3 are HQS. e previous three thresholds were obtained by FRA-A and FRA-B. FRA-B is a commercial face matcher for selfidentification channels. We assume that the terms A, B, and C are used to represent thresholds for FRA-A at 1%, 0.1%, and 0.01% FAR, respectively, and that L, M, and N are the corresponding FRA-B counterparts. To ensure that the established labels do not rely heavily on a single recognition algorithm, we used the threshold value combined with two algorithms to classify the image. Different SS may be obtained for the same pair of images with features extracted by different types of recognition. erefore, the similarity score In this work, T1, T2, and T3 are 0.4, 0.54, and 0.65, respectively, after transformation. e similarity scores of image pairs obtained from the FRA-A were multiplied by N/ C. e boundary between MQS and HQS was set at a certain interval to make the two types of samples more distinguishable. Each class image was screened by the human visual system with definition refinement criteria, as shown in Table 4.
Finally, each of the above three categories was divided into three categories based on the brightness ranges defined above. Of the remaining images selected from the two categories bright and dark brightness, we selected the image with a certain brightness difference between bright and dark brightness as the appropriate brightness, which is consistent with the brightness level of the previous selection criteria. On the basis of establishing the VHQI, the flowchart for establishing these labels is shown in Figure 8. e face images were divided into nine categories. In the next section, M-PIE data were synthesized to simulate these nine types of data.

Network Structure.
Based on the quality labels established for the dataset above, we attempted to train a classification model to predict image quality. Given that deep learning has been used to make great achievements in the field of computer vision, we also adopted this method to achieve the goal of this study. An important application of FIQA involves embedding it into a real-time face recognition system to improve the recognition or verification performance. FIQA has to be very efficient; otherwise, it would not make sense to use FIQA in real-time face recognition systems. is efficiency includes the model storage and prediction speed. e problem with model storage is that a large number of weight parameters induce high requirements on the device memory. e speed problem is mainly due to poor processor performance or high computational requirements. Lightweight classification networks became our primary choices for efficiency improvement. We   adopted lightweight MobileNetV3 to predict FIQ. Mobile-NetV3 is created through a combination of network design and automated search algorithms, including network architecture search (NAS) and the NetAdapt algorithm. MobileNetV3 can achieve higher accuracy while reducing latency for classification. MobileNetV3 is ameliorated from MobileNetV2 [51] and includes a resource-efficient block with inverted residuals and linear bottlenecks. ese improvements were realized by redesigning expensive layers, introducing a new nonlinearity and adding a squeeze-and-excite (SE) submodule [52] . e initial set of filters decreased from 32 to 16, the last few layers of the network were removed, and the position was changed to maintain accuracy and reduce latency. e hard version of swish (h-swish) was proposed and used in the second half of the model to reduce the number of memories. Swish and h-swish are defined by the following formulas. e SE fixed at 1/4 of the number of channels was added after depthwise (DW) convolution: Two MobileNetV3 models named MobileNetV3-Large and MobileNetV3-Small were created for high and low resource use cases, respectively. FIQA preferably has a faster response time, so MobileNetV3-Small was chosen for this work. In the literature [53], inspired by network pruning, Xu  Scientific Programming et al. proposed that IdleBlock targeting creates a larger receptive field and introduced a hybrid composition of Idle-Block with normal blocks that have constrained input and output dimensions. It was shown that hybrid composition networks with IdleBlocks are more efficient and able to both reduce computation and achieve real-world speed increases. e IdleBlock is implemented by a simple pruning method that involves concatenating a subspace (C · α channels, α is between 0 and 1) of inputs including C channels and the rest subspace (C · (1 − α)) with transformations. An illustration featuring an inverted residual block (MBBlock) is shown in Figure 9. In this work, we replaced two MBBlocks by IdleBlock with half-pruned channels. e last two layers were replaced by fully connected layers. e architecture of the MobileNetV3-Small with IdleBlock is shown in Figure 10.

Results and Discussion
In this section, a series of experiments were conducted to verify the effectiveness of the proposed method. We demonstrated the performance of the classification model using various classification evaluation indicators. We report the rationality of established labels and the robustness of the proposed FIQA method for different FRAs. Finally, we conducted a cross-validation experiment on the LFW.

Synthesizing Data on the M-PIE.
To more transparently explain the feasibility of the method for establishing the labels and the performance of the assessment method, we synthesized images with varying degrees of brightness and sharpness by adjusting brightness and implementing the blur methods mentioned above. Similar to the SC database, the selected data from the M-PIE contain frontal poses and neutral expressions. ree types of blurred images (L_1blur, M_2blur, and H_3blur) are obtained by setting the K parameter to 12 and 20. Appropriate, bright, and dark brightness ranges correspond to the brightnesses in Bri3, Bri1 * , and Bri7 * , respectively. Specifically, the Bri7 * data are less than the data of the other two brightness images, so the Bri6 * and number "18" data are dimmed to Bri7 * brightness. We applied the method described in the previous section to establish labels and screened a total of approximately 15,000 images, including approximately 3,000 "3nor" images and 1,500 images from the other 8 categories. e 9 types of synthesized and labeled M-PIE images are shown in Figure 11.

Training Setup.
In our implementation, hardware with 4 GeForce GTX 2080Ti GPUs was used for accelerated training, and the PyTorch deep learning framework was adopted under the Ubuntu 16.04 operating environment. A stochastic gradient descent with 0.9 momentum was chosen. e learning rate was initialized to 0.01, with a batch size of 256, and attenuated to 1e − 5 according to the adjustment strategy. e input image size was fixed to 96 × 96, and the input image was preprocessed as the input of the chosen FRAs. Eighty percent of the labeled SC and CMU datasets are used as training sets and the rest are used for testing. All models were trained with 100 epochs.

Classification Model Results on the SC Database.
e easiest way to evaluate a classification model is to calculate the accuracy. e accuracy is the percentage of the number of correct predictions in the total samples. e accuracy rates of the trained model called FBSA_M (face brightness and sharpness model) and FBSA_M1 using MobileNetV3-Small on the SC database without and with IdleBlock were 89.83% and 90.87%, respectively. Since accuracy is not a comprehensive evaluation index, we also calculated the precision and recall. e precision is the probability that samples will actually be positive among all the samples predicted to be positive. e recall is defined as the proportion of the number of samples predicted to be positive to the true number of positive samples. e precision and recall of each class are shown in Table 5. Table 5 shows that the three classes of L_1blur images have the best classification effects. e results from the three types of M_2blur images are relatively poor. e reason for these poor results may be that the similarity fraction interval for the three types of H_3blur images is small, making the model classification difficult. Furthermore, we illustrate the confusion matrix of the classification results in Figure 12 to show the class in which each sample was assigned. Overall, most images were correctly classified. Basically, the misclassified samples were grouped into adjacent categories that had the same degree of either blur or brightness. e number of samples that were incorrectly predicted to be L_1blur was extremely small, which indicates that the model is still effective when limiting the recognition of low-quality images (L_1blur).

Classification Model Results on the M-PIE.
On the M-PIE, the classification accuracy of the model was 99.00% and 99.51% without and with IdleBlock (FBSA_M2), respectively. e model classification effect is very good, so we do not show the corresponding precision, recall, and confusion matrix of the model. e effect on the M-PIE is better than that on the SC database, probably because the M-PIE data were collected in a controlled environment and each kind of synthesized data had excellent consistency, resulting in easier classification.

FIQA Performance.
e error-versus-reject curve [54] (ERC) proposed by Grother and Tabassi is often used to evaluate FIQA performance. In this method, FRAs are used to determine whether the image pairs match, and FIQA is used to predict the quality of the image for later filtering. First, an error rate is selected based on a fixed threshold of similarity scores. e error rate is then recalculated by removing images whose quality scores predicted by an FIQA model are below the ever-increasing quality threshold. Finally, the quality threshold is taken as the abscissa, and the recalculated error rate is taken as the ordinate to draw a curve and obtain the ERC. e quality labels established for images in this paper were category labels, and each category represented a different quality. erefore, when drawing the ERC, we chose to remove a certain type of image rather than selecting quality thresholds and removing images below these thresholds. From this curve, the reasonability of our labels and the performance of model classification can be revealed. To verify that the quality prediction model we trained was still effective for other recognition algorithms, we chose four algorithms to verify the quality assessment effect. e four algorithms included a Light CNN-9 plus residual layer network (FRA-C), LightCNN-29 (FRA-D), IR-50 [55] (FRA-E), and IR-152 [55] (FRA-F). FRA-C was trained on the same training set as FRA-A. FRA-D, FRA-E, and FRA-F were trained on MS-Celeb-1M. We chose the false nonmatch rate (FNMR) as the error rate, with initial values of 0.20 and 0.35, to show the relationship between the prediction quality and recognition performance for all the FRAs mentioned in this paper.

Performance of FBSA_M1 on the SC Database.
e resulting ERCs are shown in Figure 13 upon removing each type of image according to the predicted labels on the SC database. On the whole, the three categories of images (L_1blur, M_2blur, and H_3blur) had certain degrees of discrimination. When the threshold was equal to 0.2 FNMR, the L_1blur images exhibited a good distinction from the other 6 categories of images for all FRAs. ese 3 categories of M_2blur are similar to the results of "3dark" and "3bri." is similarity may be partly due to the mutual classification error between M_2blur, "3nor" and "3bri," resulting in some of the predicted M_2blur categories being greater than the required threshold; likewise, the opposite is the case for "3bri" and "3nor." e reason for this situation may also be partly because the numbers of these classes below the threshold are similar. However, the average score of M_2blur was lower than that of "3dark" or "3bri", so when the threshold was 0.35, M_2blur and H_3blur were clearly distinguished. In the case of two thresholds, the FNMRs of 6 FRAs all decreased significantly after the removal of the three types of L_1blur images. is result indicates that these   Scientific Programming three types of images hinder recognition, even for FRA-C, FRA-D, FRA-E, and FRA-F, which did not participate in the establishment of the quality labels. After the three types of H_3blur images were removed, the FNMR values increased to different degrees, meaning that these images can promote recognition. For all FRAs, the FNMRs reached their peaks after "3nor" was removed; this phenomenon is consistent with the definition of "3nor" images as VHQIs in this paper.
e mentioned experiments show that the proposed method has a certain compatibility for generalization to other FRAs. Table 6 reports the results for FRAs with and without a quality assessment module. Experiments without this module are called the baselines. Comparison methods include a general-purpose IQA method deep bilinear CNN (DBCNN) [23] and two FIQAs, i.e., FaceQnet [37] and MagFace [43]. We pretrained the synthetic distortion CNN (SCNN) in DBCNN on LFW and extended data (the crossvalidation set in 4.5) and fine-tuned DBCNN. Meanwhile, to verify the effect of IdleBlock we added, we also conducted ablation experiments with FBSA_M. e three comparison methods input an image and predict the corresponding QS. Our dataset can be roughly divided into three types of quality images. An FIQA that predicts a result for a QS requires the application of a threshold to classify an image; however, this threshold is not easy to determine. erefore, we divided the quality scores predicted by each comparison method into three categories, with the lowest quality representing the images filtered by FRAs.
At fixed FARs, all FRA performances with FaceQnet, FBSA_M, and FBSA_M1 regarding poor-quality image rejections are improved. With FBSA_M1, the TAR maximally increases by 17.11%, 16.91%, and 11.37% for FRA-A at 1% FAR, 0.1% FAR, and 0.01% FAR, respectively. For most FR algorithms, the results of FBSA_M1 is better than those of FBSA_M. For the other FRAs that were not involved in the creation of labels, the recognition rates were improved by at least 7.61%, 15.45%, and 7.95% after the FBSA_M1 module was used to filter the low-quality images. Our FBSA_M1 is at least 2.8% better than FaceQnet at 1% FAR and is slightly worse at 0.1%. DBCNN is an excellent general-purpose IQA, but DBCNN may not learn the quality characteristics related to recognition in the SC database due to the difference between the pretrained data and the real data. MagFace has little effect, probably due to the small training set relative to the 5.8 M images in the original paper. ese experiments show that the proposed FBSA_M1 can reject low-quality images to improve the recognition performance.

Performance of FBSA_M2 on the M-PIE.
e resulting ERCs are shown in Figure 14 upon removing each type of image according to the predicted labels on the M-PIE. e ERCs are similar to the results for the SC database. Under the two thresholds, the FNMRs of all FRAs decreased significantly after L_1blur was removed; as expected, the results    Figure 13 is slightly different from that expressed by the proposer. We recalculate FNMR after removing a class of images but not those below a certain quality threshold. Our ERCs can still show the effect of each type of image removed on recognition performance. were the opposite after H_3blur was removed. In addition, without "3nor," the FNMRs reached their maxima. M_2blur, "3dark," and "3bri" had a general effect at the 0.2 FNMR threshold, but the effect was improved at the 0.35 FNMR threshold. Similarly, "3nor" was predicted to have the best quality. L_1blur, M_2blur, and H_3blur had obvious differences in recognition performance, indicating that the evaluation results of the quality evaluation model in this paper were strongly correlated with the recognition performance. Table 7 summarizes the verification performance with and without different FIQAs on the M-PIE. With FBSA_M2,  Similar to the results on the SC database, our FBSA_M2 was significantly higher at 0.1% FAR, except for FRA-D, and was slightly lower than that of FaceQnet, except for FRA-A. With the increase in the training set data, DBCNN and MagFace had a significant effect, but the effect was not as good as that of our method for most FRAs. FaceQnet is slightly higher than our method for FRA-C and FRA-D, but overall, our quality prediction model was effective for filtering out low-quality images to improve the recognition rate.

Cross-Database Performance.
We have verified that our method exhibits good generalizations for different FRAs. Experiments to test the generalization capability for another LFW dataset were also conducted. We used the human visual system to assist in the establishment of labels, so we also verified whether the evaluation results were consistent with the human visual system. We evaluate different degrees of luminance and sharpness factors of face images, and the degradation span of these two factors in LFW is small. erefore, we adjusted the brightness and sharpness of the images by a method that was similar to extending the M-PIE.
Most of the subjects in this dataset had only one image, and most of the images included nonpositive postures and nonneutral expressions, so there is a lack of required VHQI for image matching. is experiment was just a test to determine whether our model worked when images included multiple distortion factors. We selected data with more than two images of one subject for testing. Because the sharpness and brightness distortions were extended in the same way as the M-PIE, we used the quality model trained on M-PIE for prediction. e ERCs of all FRAs are plotted in Figure 15. Figures 15(a) and 15(b) show that under the two initial thresholds, the FNMR can be greatly reduced after the removal of images such as L_1blur; likewise, the FNMR can be significantly increased after the removal of images such as H_3blur. ese two initial thresholds could not be employed to separate "3dark" from M_2blur, so we set the initial threshold as 0.6 FNMR and drew ERCs (see Figure 15(c)) to verify the effectiveness of our model. In this way, the quality differences between the three types of images can be visualized more clearly. In summary, our model can distinguish three types of data (L_1blur, M_2blur, and H_3blur), and these three types of data are strongly correlated with the recognition performance.
e results for FRAs with and without the state-of-theart method and the proposed method are shown in Table 8.

Scientific Programming
After the FIQAs were used to filter out low-quality images, the TARs improved for most FRAs. Our method exhibited better performance, with the highest improvements of 25.69% at 1% FAR, 13.34% at 0.1% FAR, and 5.53% at 0.01 FAR for FBSA_M2. FBSA_M2 exhibited a better performance relative to FBSA_M1, which may be attributed to the notion that the training data of FBSA_M2 were synthesized in the same way as this validation dataset. FBSA_M1 trained with data from practical application scenarios was still better than FaceQnet. Our method also produced accurate predictions of sharpness and brightness. DBCNN has some effect on all FRAs except FRA-A, and MagFace had little effect on cross-validation sets. Because IQA partially differs from FIQA, a good general-purpose IQA may not be suitable for FIQA tasks. At 0.01% FAR, the TARs had a low recognition rate. is result is because the images in the LFW contained various factors of distortion and were of worse quality after brightness and sharpness degradation, resulting in very few data that exceeded the 0.01% FAR threshold. rough our quality evaluation model, recognition performance can be effectively improved by rejecting lowquality images for different FRAs and datasets. e proposed method has better robustness.
Based on the assessment results of the model, we randomly selected images from each prediction class and displayed them in Figure 16. Based on this visualization, it is clear that the model exhibited an accurate judgment of extreme brightness and sharpness. We arranged the images based on the brightness and sharpness of adjacent classes in the training data, such as "1bri" and "2bri" and "1bri" and "1nor." erefore, some of the data are completely outside the training data for our model, which may lead to ambiguity in classification. Rowden and Jain [5] and Khodabakhsh et al. [56] concluded that human assessment strongly correlates with FRA performance. It can also be concluded that each type of predicted image is correlated with the recognition performance defined in this paper.

Conclusions
We proposed a method of establishing FIQ labels based on brightness and sharpness that are strongly correlated to recognition and trained a model to predict quality. Overall, our model can accurately classify and distinguish images of different qualities, even for other FRAs that are not involved in the label creation and model training processes. We can also accurately evaluate the quality of FRAs mentioned in this paper on the cross-validation set. Note that an improvement in the classification accuracy of the model is needed to make further progress in the future. In addition, more factors affecting identification could be considered for adaptation to more varied application scenarios. In the future, the use of FIQA to improve the performance of image research projects is worth discussing.

Data Availability
ree datasets, M-PIE, LFW, and a custom dataset, are used in this paper, among which the first two are public datasets and the last one is owned by a technology limited company (Wisesoft Co., Ltd., Chengdu, Sichuan, China.). e custom dataset cannot be made publicly available because public availability would compromise privacy and we do not have permission to share the data. To replicate our method for other researchers, we also use the same method to perform experiments on a publicly available dataset (M-PIE), which interested readers can readily access. e M-PIE and LFW used can be found at http://multipie.org and http://vis-www. cs.umass.edu/lfw/, respectively. Classification results for LFW. e brightness dims from left to right. For example, the first row corresponds to "1bri," "1nor," and "1dark." e sharpness increases from top to bottom. For example, the first column corresponds to "1bri," "2bri," and "3bri." 18 Scientific Programming