FaceFilter: Face Identification with Deep Learning and Filter Algorithm

Although signiﬁcant advances have been made recently in the ﬁeld of face recognition, these have some limitations, especially when faces are in diﬀerent poses or have diﬀerent levels of illumination, or when the face is blurred. In this study, we present a system that can directly identify an individual under all conditions by extracting the most important features and using them to identify a person. Our method uses a deep convolutional network that is trained to extract the most important features. A ﬁlter is then used to select the most signiﬁcant of these features by ﬁnding features greater than zero, storing their indices, and comparing the features of other identities with the same indices as the original image. Finally, the selected features of each identity in the dataset are subtracted from features of the original image to ﬁnd the minimum number that refers to that identity. This method gives good results, as we only extract the most important features using the ﬁlter to recognize the face in diﬀerent poses. We achieve state-of-the-art face recognition performance using only half of the 128 bytes per face. The system has an accuracy of 99.7% on the Labeled Faces in the Wild dataset and 94.02% on YouTube Faces DB.


Introduction
Recently, deep neural networks and especially convolutional neural networks (CNNs) have become the most commonly used method for feature representation and have achieved good results in face recognition problems. Face recognition can be divided into two categories: face verification, where two faces are presented and the system needs to verify whether these two faces belong to the same person, and face identification, where a face image is presented with an unknown identity and the system needs to determine this identity.
Previous approaches to face recognition that are based on the discriminative classification model (face identification) are trained on a dataset of known identities, and an intermediate bottleneck layer is used as a representation for recognition. is approach generalizes a very large representation for each face, but some works have tried to reduce this dimensionality using PCA [10].
Another approach used in FaceNet [14] directly trained its output to obtain 128-D embedding using a triplet-based loss function based on LMNN [9]. ese triplets comprise two matching faces and a nonmatching face. e aim of the triplet loss function is to separate positive results from negative ones by a certain distance margin.
In contrast, our approach uses an unsupervised learning technique to obtain 128 bytes per face and then passes these bytes to a filter in order to find the most suitable representation for each face. We then reduce the dimensionality of the representation to half of the 128 bytes, to match the original face with other faces to find the identity. is approach can identify a given face in different poses and can identify other faces that are most similar to the original identity.
As an illustration, Figure 1 shows a picture of a single individual at different angles and in different poses. e remainder of this paper is organized as follows: Section 2 discusses the most important related work in face recognition. Our method is presented in Section 3, including a description of deep neural networks and our algorithm for handling the features. Sections 4 and 5 present some quantitative results and an evaluation of these.

Related Works
Our approach is similar to other recent works [3,10,14] in that it learns its representation directly from the face. However, instead of using a vector of features for reidentification, we reduce the vector representation to half of the features extracted for each face. We use a deep convolutional neural network architecture inspired by the NN4 FaceNet [14] and OpenFace [15] networks, but we remove the L2 normalization layer and instead use another fully connected layer.
ere are an enormous number of studies of face recognition, and we will briefly discuss the most relevant works.
Huang et al. [16] proposed a convolutional deep belief network based on local convolutional restricted Boltzmann machines to learn a face representation. e learning method was unsupervised learning and the training was on an unlabeled natural image dataset. After that, they transfer the learned representation to a face identification through a classification method such as SVM.
Another attempt for face recognition was proposed by Taigman et al. [17]. is approach called DeepFace and it is one of the earlier large-scale applications of a 3D model for face recognition. ey extracted the face representation using a nine-layer DeepFace model which mainly consists of two convolutional layers, three locally connected layers, and two fully connected (FC) layers with more than 120 million parameters using several locally connected layers without weight sharing. eir system was trained on 4.4 M 2D facial images of 4,030 identities and they achieved an accuracy of 97.35% on the benchmark LFW [18] dataset.
Schroff et al. proposed a CNN-based approach used for face recognition and clustering.
is approach is called FaceNet [14] which is based on eleven convolutional and three FC layers. ey have trained a deep convolutional network on a dataset of 200 M faces and 8 M identities and triplet loss function to directly optimize the embedding instead of an intermediate bottleneck layer as in the previous works. ey have used triplets of roughly aligned matching/ nonmatching face patches using an online triplet mining method, and they achieved the performance of state-of-theart face recognition with 128 bytes for each face.
Sun et al. proposed another framework called DeepID [5,6,10] for face identification and verification. eir approach utilized an ensemble of shallower and smaller deep convolutional networks than DeepFace, i.e., every DCNN has four convolutional layers and uses 39, 31, and 1 patches, respectively, as an input. eir framework was trained on 202,599 images of 10,177 subjects. eir approach is considered as the first approach that achieved results that surpass human performance for face verification on the LFW dataset.
Parkhi et al. [19] collected a face dataset of 2.6 M 2D faces from 2,622 identities by proposing a new method for crawling the faces from the web. ey presented a VGG-Face model consisting of 16 convolutional layers and three fully connected (FC) layers. e authors claimed that they achieved 98.95% accuracy on the LFW [18] dataset.
Deep 3D face recognition results have been represented by Kim et al. [20]. ey fine-tuned the VGG-Face network [19] on 3D depth images. After that, they reported their results on three public datasets. ey used an augmented dataset of 123,325 depth images to fine-tune VGG-Face. After that, they tested the model on Bosphorus [21], BU3DFE [5], and 3D-TEC (twins) [22] datasets. But their

Deep Convolutional Networks.
We used a deep neural network structure called an NN4 neural network. Before they were input to the network, we resized all images to a size of 96 × 96 × 3. ese were used as input to the first convolutional layer, which has 64 kernels of size 7 × 7 × 3 with stride 2. e second convolutional layer has 64 kernels of size 1 × 1 × 3 with stride 2, and in the third convolutional layer, 192 kernels are used with size 3 × 3 × 3 and stride 2. After these layers, an inception architecture was used in which there were six blocks labeled inception 3a, inception 3b, inception 3c, inception 4a, inception 4e, and inception 5a [23].
Since the input of the network was 96 × 96 × 3 and the receptive field was small, the computational requirement was drastically reduced. e total number of parameters was 3,743,925, and the number of trainable parameters was 3,734,613, with 9,312 nontrainable parameters. We trained the network using a stochastic gradient descent (SGD) algorithm with a learning rate starting from 0.05 on a GPU. e model was trained on 202,599 face images of 10,177 subjects. Table 1 shows the network structure. Figure 2 depicts the model diagram while Figure 3 illustrates in detail the structure of inceptions used in this study.
Before training, we used the FaceNet [14] weights as a baseline in our network which used the triplet loss function in its training. en, we used the Kullback-Leibler (KL) divergence loss functions to train our model as in Variational Feature Learning (VFL) [24] loss function. e difference between our loss function and VFL loss function is that, in VFL, they used the same input and output for two fully connected layers to be used to predict the mean μ and standard deviation σ of a Gaussian distribution. e mean μ and standard deviation σ are used to calculate the loss function which employed the Kullback-Leibler (KL) divergence loss. But in our training, since all input and output for the two fully connected layers are the same, we used one fully connected layer "fc1" in the network to be used to predict the mean µ and standard deviation σ of a Gaussian distribution. e mean µ and standard deviation σ are used to calculate the loss function as follows: where n denotes the output vector size, i.e., 128 in our training. e network is trained with a softmax classifier for 200 epochs by using an Adam optimizer [25] and learning rate starting from 0.05. e training dataset divided into 70% for the training set and 30% for the validation set.

Face Reidentification Equations.
Each original image x that we want to predict is represented by f(x) ∈ R as a vector of 128 bytes indexes from 1 to 128.
is can be expressed as in (1): where x o is the original image that we want to predict and n is the number of features v in that vector. e vectors of the identities in the dataset will also be extracted, as it is expressed as in (2), and kept in a separate model file: where n � 128 and x i id is the image of a particular identity id and i refers to the number of that identity in the dataset. After extracting the vectors, we will pass the vector of the original image to a filter to extract the most important values that can represent the original image. e filter works as a net to select the highest values among the features in the vector of the original image. It takes the values greater than zero with their corresponding position, i.e., indices of each value: where n is the number of features v i o in the vector of the original image, i.e., 128, and i is the index of each feature in the vector. e selected features v i o which have values greater than zero will be stored in val while their corresponding indices will be stored in in d. So, we can select all features of each image in the identities of the dataset with the same indices of the selected features of the original image: where id refers to a particular identity in the set of identities i and ni d is the number of features in each image of the identities. e selected features v i id of the identity id will be chosen if its indices index i id is equal to the indices of the selected features of the original image index i o and will be stored in val i id . Here, we do not need to select the values greater than zero for each identity in the dataset; rather, we just take the values corresponding to the indices of the largest values in the original image. is step is very important as the features of an eye, for example, may store in a particular index; consequently, we need to take the feature of that eye in each image in the dataset.
To recognize the identity, we will calculate the distance between the filtered values of the original image and the corresponding values of each identity image in the dataset. e lowest distance between the filtered values of the original image and a particular identity image both will have the same identity: where i refers to the number of identities. It should be noticed that we have only weights of the images of all identities obtained by the model where these weights Scientific Programming have been kept in another model file called acknowledge base.

Image Aligning with Face Reidentification.
Face detection and recognition still have many problems to identify the face especially when the face is aligned to down or to any other angle in the image. is problem can be solved by searching for a face in the image. If the image does not have a face, we will rotate the image step by step from 0 to 360°, where each step is rotated for 14°, until we find a face in that image and pass it as a new image. erefore, the total number of steps is 25. In case we could not find a face in the image after rotating it, we will pass the image without  rotating because there may be a face in the image where the face is in different poses and cannot be detected. Figure 4 shows an image with a face that the face detector cannot detect it, but after the rotations, we find a face while Figure 5 shows a face that cannot be detected after 360 rotations, so the original image will not be changed.

Evaluation
We used a neural network to extract the features of the faces. Feature extraction takes 128 bytes for each face and then finds the weights greater than zero from the original image with their corresponding indices and finds the other weights  Scientific Programming of the identities with the corresponding indices of the original image. e process of selecting weights larger than zero with their corresponding indices is called a filter process where the dimensional of the vector will be reduced to half of 128 bytes. After that, the distance of the filtered bytes of the original image with the bytes of each identity in the same indices of the original image is calculated to find the minimum number. e minimum number will refer to the identity of the original image. We evaluated the network on the Labeled Faces in the Wild and YTF [26] datasets. ese two datasets have been used in most previous works which got a state-of-the-art results in their evaluation process. We achieved good results on these two datasets.
In the evaluation process, we extracted the features of each image in the dataset where each image has 128 features and stored them in a separate file. en, we divided the weights into blocks by dividing the total weights by 128 to find the number of identities in d as in (7). Each block will contain 128 weights and will be treated as a single block for a single identity: For the original image that we want to identify it, we extracted its 128 features using our model and passed these features to the filter to find the most important features for representation and reduce the dimensionality to the half. After choosing the positive values of each feature of the original image and taking their corresponding indexes, we will extract the features of each block from the features of the dataset according to the indexes of the positive features of the original image as in the following equation: where o i2 is the indexes of the filter weights of the original image and the values of i2 are indexed from 0 to half of 128. W is the weight of the identity i dn. Finally, we applied (6) to identify the image.  Figure 4: e first image is the original image. e detector could not detect the face, so we rotated it. After two steps of rotations (rotate 1 and rotate 2), the detector still could not detect the face. After 16 steps of rotations, the detector detected the face successfully; therefore, the original image will be the image in (rotate 16). e identity of the face in rotate 16 has been recognized effectively by the system.

Various Dimensionalities.
Various embedding dimensionalities were explored in previous studies [14], and accordingly, the dimension 128 has been selected as it gives the best accuracy. e comparison between four embedding dimensionalities, 64, 128, 256, and 512, shows that the difference in the performance is small. In this study, we explored the best dimensionality, i.e., 128, before and after applying the filter. After applying the filter to the dimension of 128, the dimensionality has reduced to the half of 128 with higher accuracy of the dimension of 128 by using our new algorithm.

Acknowledge Base Identities.
In order to increase the number of identities without looking at the picture of any identity in the dataset again, acknowledge base model has been created to save the features of each identity. e features of any new identity will be saved in the acknowledge base model. is acknowledge base model will be used to know any other unseen face picture to predict the identity.

Performance on LFW and YTF.
During the evaluation, the feature of every identity is extracted and kept in the acknowledge base. Any other extraction for any identity will be added to the acknowledge base with its corresponding label of that identity. Every time in the evaluation step, we took 200k images for test and kept their features with their corresponding labels in the acknowledge base with any previous features extracted for any identity. at means the acknowledge base model can store the features of all images in the dataset and it can find the single identity of any face among all these identities. We achieved a classification accuracy of 99.70% on the LFW dataset and 94.02% on the YTF dataset. Table 2 and Table 3 show the classification accuracy with some methods as compared to our classification accuracy on LFW and YTF. Figures 6 and 7 demonstrate a comparison chart for previous studies with LFW and YTF.

Conclusions
Deep neural network is used in this paper for face reidentification. e filter technique is used to select the most important features from the features extracted by the model. is method can identify the face in different poses and different levels of illumination. e rotation technique for 360°is used for the images that have the face in different angles, while this kind of rotation cannot be done in the augmentation method in deep learning.
We noticed that deep learning is very important to extract the features, but with well-prepared mathematical operations on the extracted features from the deep learning, it can increase the accuracy of the model. Data Availability e model, extraction of weight code, features saved in acknowledge base, and the equations for evaluation code are available in the following URL: https://drive.google.com/ open?id�1pXMkhAOx9zV4n8ynmer2xlF5lLeQZ3Rz. Disclosure e funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; and in the decision to publish the results.