An Efficient Face Detection and Recognition System Using RVJA and SCNN

Te basic process for an extensive range of security systems functioning in real-time applications is facial recognition. Considering several factors like lower resolution, occlusion, illumination, noise, along with pose variation, a satisfactory outcome was not achieved by various models developed for face recognition (FR). Terefore, by utilizing reconstruction scheme-centric Viola–Jones algorithm (RVJA) and shallowest sketch-centered convolution neural network (SCNN) methodologies, an efectual face detection and recognition (FDR) system has been proposed here by considering the aforementioned factors. Specifcally, frst, the algorithm identifes faces in a provided image by determining its global facial model in various positions along with poses; then, it sequentially enhanced the recognition outcome by utilizing SCNN. Initially, by employing the RVJA, face detection (FD) has been performed. Te unconstrained face images are handled by the proposed RVJA having efcient properties such as boundedness and invariance, together with the ability to rebuild the actual image. After that, for FR, the SCNN methodology is utilized, thus learning the complicated features of the face-detected images. Next, regarding metrics like area under curve (AUC), recognition accuracy (RA), and average precision (AP), the proposed methodology’s experiential outcome is analogized with other prevailing methodologies. Te experimental outcome displayed that the facial images are recognized by the proposed model with higher accuracy than that of the other conventional methodologies.


Introduction
In the military as well as commercial felds, FR has been deemed as a popular research area [1]. By evaluating along with relating patterns grounded on a person's facial features, the person's identity is verifed or identifed by utilizing a technique termed FR [2,3]. In a real-world scenario, various factors like proxy via pictures, lighting conditions, and lower-quality image processing are the limitations faced by FR algorithms [4]. Moreover, FR turns into a complicated task with the factors like the partial or total occlusion with other objects, the view angle owing to the camera position, or lower-resolution sensors of the obtained image [5,6]. For humans, FR is extremely efortless; however, it is distinctive for a machine [7]. An extremely swift progress was shown by the FR methodologies' development in the last two decades [8,9]. In numerous felds like the monitoring system of the bank self-service cash machine, the new face-brushing technology of Alipay, the verifcation of identity by each application face scanning, along with the face unlocking of the mobile phone, FDR is applied with the continuous enhancement of science and technology [10]. Te FDR methodologies, which have been well-studied in the computer vision domain, have been amalgamated with these systems in an attempt to handle certain external issues like computational cost, face capture angle, facial expression, the existence of hair, along with facial alteration relying on the luminosity, time, usage of accessories or ornaments, classifer performance, ethnic variations amongst others, and longer distances as of the camera [11,12].
Deep learning can achieve a good approximation of a complex function through increments of hidden layers; hence, it is capable of achieving a better result in face recognition [13]. In the process of recognizing faces under well-constrained conditions like standard illumination as well as frontal pose, human performance was outshined by the automatic FR systems grounded on deep convolutional neural networks (DCNNs) with the recent enhancements in deep learning (DL) [14][15][16]. Terefore, by utilizing RVJA and SCNN, an efectual FDR model has been proposed to trounce the aforementioned issues. Te proposed technique's major contributions are enlisted further.
(i) For detecting the face, RVJA that handled unconstrained face images was proposed (ii) For face recognition, the SCNN approach that learns the complicated features of the face-detected images was proposed Te paper's remaining parts are structured as follows: the conventional research models pertinent to facial image recognition are surveyed in Section 2; the proposed framework is explicated in Section 3; the proposed model's performance is assessed in Section 4; lastly, the paper is winded up with the future advancement in Section 5.

Literature Survey
Cheng et al. [17] developed a two-layer CNN to learn the higher-level features for FR by means of a sparse representation that meagerly specifes the face image by a subset of training data. Te FR system's performance was enhanced considerably by the description of the provided input face image. Te experiential outcomes displayed that on the given dataset, a better performance was achieved by the presented system when analogized with other systems. Nonetheless, a larger dataset was required by the CNN.
Iqbal et al. [18] examined hybrid angularly discriminative features by amalgamating multiplicative angular together with an additive cosine margin for enhancing the efcacy of angular SoftMax loss as well as large margin cosine. By utilizing the CASIA-WebFace dataset, the model was trained; subsequently, on YouTube Faces (YTF), Labeled Face in the Wild (LFW), VGGFace1, and VGGFace2, the testing was conducted. Te experimental outcome displayed that the model had accuracy, which was higher than the prevailing methodologies. However, more time was consumed by this model.
Zhao et al. [19] presented data augmentation via image brightness alterations, geometric transformation, along with the application of varied flter operations. Furthermore, regarding orthogonal experimentations, the fnest data augmentation methodology was determined. Eventually, the present system's performance utilizing FR was illustrated in a real class. Te developed system attained better accuracy than the PCA as well as LBPH methodologies with data augmentation. Nevertheless, more time was consumed by the VGG-16 network to train its parameters.
Zhao et al. [20] constructed a deep neural network (DNN) to deeply encode the face regions; in this, a face alignment algorithm was deployed for localizing the key points inside the faces. Ten, to abate the deep features' dimensionality, the PCA was employed; similarly, for evaluating the similarity of feature vectors, a joint Bayesian model was utilized; thus, it attained highly competitive face classifcation accuracy. In addition, several FR attacks were handled by the FR system under diferent contexts. However, when analogized with conventional machine learning algorithms, the neural network needed more data.
Alghaili et al. [21] suggested a system that could directly detect an individual under all criteria by extracting the most signifcant features along with utilizing them to recognize a person. To extract the most signifcant features, a DCNN was trained. Ten, the signifcant features were selected by utilizing a flter. After that, to detect the minimum number that denotes the identity, the selected features of every single identity in the dataset were subtracted as of the actual image's features. Te outcomes displayed that the presented model recognized the face efectively in varied poses. However, owing to an operation termed max pool, the DNN was slow.
Lin et al. [22] presented a feature extraction methodology that utilized the thermal image for transmuting into features. In addition to that, the author utilized DL, Random Forest, along with ensemble learning to construct a FR model. In the feature extraction methodology, the facial image was cut into blocks; subsequently, the feature image and the feature matrix were regenerated. Te empirical outcomes demonstrated that a higher prediction performance was achieved by the feature extraction technique. Nevertheless, for prediction, this model required more time.
Lei et al. [23] constructed a hybrid model grounded on DL, visual tracking, and RFR-DLVT to obtain efcient FR. Initially, video sequences were separated into reference 2 Mathematical Problems in Engineering frames (RFs) and nonreference frames (NRFs). Next, in RFs, by means of the DL-centric FR methodology, the target face was recognized. Meanwhile, in NRFs, to speed up FR, the Kernelized Correlation Filters-centered visual tracking model was employed. Te model was tested on common datasets; it also attained better performance. Nevertheless, to attain better performance, a larger amount of data were required by this model. Tabassum et al. [24] amalgamated the coherence of discrete wavelet transform (DWT) with "4" varied algorithms: (i) eigen vector of linear discriminant analysis (LDA), (ii) error vector of principal component analysis (PCA), (iii) eigen vector of PCA, and (iv) CNN for enhancing the FR accuracy; subsequently, by utilizing entropy of detection probability along with the fuzzy system, the four outcomes were amalgamated. Depending on the image along with the diversity of the database, the recognition accuracy was established. However, there eventuated overftting problem owing to this CNN.
Teoh et al. [25] structured an FR as well as an identifcation system by utilizing a DL methodology. Primarily, it spotted the faces in the images or videos. Next, the recognition was performed after training the classifer. Te Haar feature-centric cascade classifer was utilized in FD. In the system classifer section, the tensor fow model was employed. In the recognition process, the classifer is utilized after being trained. Te experiential outcomes were given for demonstrating the system's accuracy. Nevertheless, there occurred false-positive detections whilst utilizing the Haar feature-centric cascade classifer.

Proposed Facial Image Recognition System
Te input images are frst fed to the RVJA for FD. Ten, the face-detected images are rescaled along with normalization using the SCNN model for accurate FR. Figure 1 exhibits the block diagram of the proposed framework.

Face Area Detection.
Tis is the initial step. Here, from the input images W n , the face part is segmented. Te RVJA is utilized for segmenting the face part. Only frontal faces are detected by the traditional VJA algorithm; hence, it is inefective whilst detecting sideways, upward, or downward. Terefore, for the input images, iterative closest normalized pixel diference (ICNPD) features are computed to enhance the detection efciency; in addition, it reconstructed the face models in various poses along with varied directions. Consequently, owing to the properties like boundedness, invariance, and enabling actual image reconstruction, the FD is enabled by the reconstruction strategy under unconstrained situations. Four phases are enclosed in the VJA algorithm. Tey are as follows: selecting features, creating an integral image, AdaBoost training, and cascading.

Selecting Features.
Tis is the major process in the generation of a face model with infnite novel poses. Here, by utilizing the NPD features' optimal subset, the signifcant properties like boundedness and scale invariance are gauged.
After that, the optimal transformation is detected by the iterative closest point (ICP) methodology; here, the approximate feature location is detected and the face cropping and rough alignment are performed as long as the detected point is close enough to the true location.
At frst, the ICNPD feature vector is calculated; then, the algorithm outlines a box with this feature vector. After that, by scanning every single subregions of the image from top to bottom, the outlined box searches for a face in the provided image. Te NPD feature betwixt two pixels in the image is measured as follows: Next, it computed the iterative closest points betwixt the closest point query in the target set; in addition, to reconstruct the face model, the distance betwixt respective points is minimized. By subtracting the sum of pixels in rotation from the sum of pixels in translation, the value of these features is computed as follows: where the features' value is specifed as F HLF and the relative rotation and translation pixels computed in the closest form are signifed as (x, y) R and (x, y) T . Various parts of the face can be interpreted by utilizing such features.

Integral Image Creation.
Te computation is performed for all the pixels in a specifc feature whilst calculating the feature values. Te number of pixels in the large features is high; thus, for larger features, the computation is highly challenging. To make computations efective, the integral image, that is to say, an image's intermediate representation permits the fast computation of a rectangular region being generated; here, the entire pixel values are provided by the sum of the left and above the specifc pixel. Te integral image is expressed as follows: where the integral image is notated as igl(x, y) and the original image is symbolized as W n (x, y), n � 1, 2, . . . . . . , N. Te recursion formula utilized in the integral computation is expressed as follows: cs(x, y) � cs(x, y − 1) + W n (x, y), where the cumulative row sum is indicated as cs(x, y). Te value of rectangle-like features is computed utilizing the integral image with "4" values present at every single corner of the provided rectangle rather than computing the sum of all pixels.

AdaBoost
Training. Several thousand features may be computed when utilizing a base window for analyzing the features; however, for detecting the face, only fewer features are utilized. Terefore, the AdaBoost algorithm is utilized to select the best feature. By amalgamating the weighted weak classifers, a strong classifer is formed by this algorithm. It is expressed as follows: where the strong classifer is specifed as C(G), the number of features known as weak classifers is signifed as F HLF(i) (G), and the classifers' respective weights are notated as θ i . To decide whether the image's subregion has any face or not, the amalgamation of these features is utilized.

Cascading.
Here, to the provided subregion, a series of classifers, that is to say, a cascaded system, which comprises numerous stages to detect the face, is espoused. After entering the subregion to the cascaded system, the regions with faces are forwarded to all the stages; conversely, the regions devoid of faces are rejected at the particular stage itself. By doing so, the system saves time by avoiding the image's nonface region; in addition, it detects faces under varied expressions, poses, and illumination along with disguise. Terefore, the face images being segmented are represented as W face n . Te VJA algorithm's pseudocode is illustrated in Algorithm 1.
Te fundamental steps undergone by the VJA segmentation algorithm are described in Algorithm 1. (i) Feature computation, (ii) feature selection, and (iii) object detection are the steps undergone by the algorithm to detect along with the segment of the face from input frames.

Rescaling and Normalization.
Te face-detected images are applied for the preprocessing models like rescaling and normalization to enhance the feature extraction stage's overall quality. A nonconforming feature representation is generated by the image of a face on a varied scale as of the model's generalization; thus, rescaling is performed; conversely, normalization is performed to normalize the range of the pixel values of the input images with the intention to abate the higher variation in the values. Te normalization is formulated as follows: where the current pixel is specifed as W nor n(i) , all other pixel values are notated as W nor n(−i) , and the rescaled images are signifed as W rs n .

Face Recognition.
Here, to recognize the person, the normalized images are obtained by utilizing the SCNN model. CNN is a DL algorithm; it comprises two sorts of hidden layers; they are the convolution layer (CL) and the pooling layer (PL) that are arranged alternatively in the neural network. To match the face of the user, the system provides output by conducting FR. Here, to perform a novel feature-sharing technique, the shallowest layers are incorporated with the hidden layers, thus achieving higher runtime efciency. In the shallow layer, initially, all facial landmarks are estimated historically for preserving the facial structure; in addition, to construct the sketch feature vector, localization-sensitive information is utilized. Terefore, the CL along with the PL is fed with the sketch feature built with landmark extracted images as of the shallow layer. Tus, here, the recognition accuracy is enhanced when there is an unknown facial query that has to be identifed. In this manner, the FR network, which subsumes "3" phases, namely, (i) shape prediction stage, (ii) feature extraction, and (iii) classifcation, enhances the recognition accuracy considerably. Figure 2 exhibits the architecture of the proposed SCNN.

Shape Prediction Stage.
Here, to estimate the face contour termed sketch, the shallowest layer localizes the set of facial landmarks for the provided input images. Te facial region is depicted by the localized landmarks; in addition, they are linked in a fxed order to build the sketch feature vector. Terefore, the extracted landmark features are modeled as follows: where the shape indexed feature map respective to the shallowest layer K l Sha(•) is represented as k n , and the set of localized landmarks is indicated as l. Along with the extracted landmarks, the sketch feature vector is engendered with the predicted shape. Consequently, to train the network, which utilizes the facial attributes together with geometric relationships betwixt the sketches and the images, the feature vectors are jointly utilized.

Convolution Layer.
Tis is the frst layer in the CNN having a set of feature detectors named as kernels; subsequently, every single kernel has its corresponding bias value. To verify whether the feature is present or not, the kernels execute convolution by moving across the image's receptive felds under the guidance of the shape from the preceding stage. Te convolution operation betwixt the input's connected region and weights is formulated as follows: where the nonlinear activation function is specifed as z n , the input nodes' weight vector is depicted as θ(q), and the deep  Mathematical Problems in Engineering feature map acquired as of the CL denoting the relationship betwixt the sketches and respective images are symbolized as K l Con .

Pooling Layer.
In this layer, the down sampling is performed to mitigate the convoluted feature map's size, thus lessening the number of computations needed. By utilizing the max-pooling function, the dimensionality of input is scaled by the PL. It is formulated as follows: where the max function is represented as G max(n) and the pooled feature map is notated as K l pool (q). A new vector is formed by the feature maps from the CL along with the PL; subsequently, they are fattened to obtain the column vector V col(n) .

Fully Connected Layer. Te fattened matrix V col(n) is
given to the fully connected layer. Tis layer provides the input to the SoftMax layer (SL). Te probabilities of the input being in a specifc class are ofered by the SL. Te SL's output is illustrated as follows: where the layer's bias value is defned as z, the classifcation output is symbolized as K l sfm . (i) Known person and (ii) unknown person are the "2" classes of outputs encompassed in the SL.

Result and Discussion
Te proposed facial image recognition system is experimentally analyzed in this section. Te model was executed in MATLAB. In the evaluation process, to verify the system's efcacy, the proposed system's outcomes are analogized with the prevailing DL methodologies.

Database Description.
To study the problem of unconstrained FR, the LFW database was utilized in the proposed work. More than 13,000 images of faces gathered as of the web were included in the dataset. Te dataset comprises a larger range of poses, illumination, together with expression in face images. 5749 identities with 1680 people with 2 or more images were encompassed in this dataset. Te verifcation accuracies were reported on every single face pairs in the standard LFW evaluation protocol. Every single face has been labeled with the pictured person's name. Of the data available in the dataset, 20% are utilized for testing and 80% are wielded for training in Table 1.

Performance Analysis.
Here, DNN, Elman neural network (ENN), artifcial neural network (ANN), and CNN are the prevailing models with which the proposed system is analyzed regarding AUC, AP, and RA, along with training time. In addition, the proposed system is analogized with the detection rate of several FD algorithms like the Viola-Jones algorithm (VJA), Joint Cascade (JCascade), and aggregate channel features (ACF).
Te proposed along with prevailing methodologies' RA is exhibited in Table 2. As per the table, an accuracy of 97.14% was attained by the proposed model, whereas the accuracy values attained by the prevailing systems were lower than that of the proposed methodology. Tus, it is established that when analogized with the prevailing DL models, the proposed work attained better performance.
Te AUC graph for the proposed together with prevailing methodologies' is shown in Figure 3. Te proposed model obtained an AUC of 94.65%, whereas the conventional DNN, ENN, ANN, and CNN models attained AUC values of 87.56%, 88.27%, 89.92%, and 92.39%, respectively, which are lower than that of the proposed model. When compared with other classifers, the proposed one achieved better performance followed by the CNN and lastly the DNN. Tus, it is evident that for facial image recognition, the proposed model is highly efective with more accuracy.
Te proposed, as well as prevailing models' AP, is demonstrated in Figure 4. An AP of 97.29% was attained by the proposed SCNN. Conversely, a lower accuracy of 90.56%, 91.31%, 93.34%, and 95.83% was attained by the prevailing DNN, ANN, ENN, and CNN systems, correspondingly. Te evaluation outcomes proved that for FR, an efectual performance was achieved by the proposed model even with an unconstrained image.
Regarding training time, the proposed along with conventional FR methods' computation complexity is assessed in Figure 5. A model with lower training time is considered to be the better model. Te training time taken by the traditional models was the CNN (10.13 ms), ANN (11.24 ms), ENN (12.04 ms), and DNN (14.28 ms), which are larger than the training time of 8.42 ms taken by the proposed system. Terefore, it is clear that when analogized with the prevailing works, the proposed one achieved better performance.
In terms of detection rate, several FD algorithms are evaluated in Table 3. Te proposed model attained a higher detection rate than the prevailing methodologies. When analogized with the prevailing detection methodologies, the detection rate attained by the proposed model was enhanced by up to 13.16%. Consequently, the proposed framework's detection efcacy was enhanced by the efectual features being selected. Tus, from the assessment, it is evident that even for unconstrained images, the FD was performed more efectively by the proposed methodology.
Regarding recognition accuracy, the comparison of the proposed model with prevailing CNN-based frameworks is exhibited in Table 4. Te recognition accuracy attained by the proposed S-CNN is 97.14%, while the prevailing CNN, MT-CNN, and DWT-CNN attain 86.3%, 96.4%, and 89.56%, correspondingly. On comparing these outcomes, the proposed S-CNN achieves higher recognition accuracy than the conventional techniques. Tus, it is concluded that the proposed system is more efcient in facial image recognition.

Conclusion
By utilizing RVJA as well as SCNN, an efectual FDR system has been proposed here. To detect the face in varied poses, diferent levels of illumination, along with occlusion, the most signifcant features are selected as of the input images by utilizing the RVJA technique. Next, for recognition, the SCNN with well-trained mathematical operations was employed. In performance evaluation, regarding performance metrics like RA, AUC, and AP, the proposed SCNN is analogized with the prevailing existing DNN, CNN, ANN, and ENN methodologies. Te evaluation outcomes displayed that higher accuracy of 97.14% was achieved by the proposed model than the conventional systems. Terefore, it is concluded that for the facial image recognition system, the proposed framework is better along with highly efective. In the future, to recognize the images with imperfect facial data, the work could be enhanced with some advanced models.

Data Availability
Te data used to support the fndings of this study are available from the frst author upon request at any time.

Conflicts of Interest
Te authors declare that they have no conficts of interest.