Identifying Ethnics of People through Face Recognition: A Deep CNN Approach

. The interest in face recognition studies has grown rapidly in the last decade. One of the most important problems in face recognition is the identiﬁcation of ethnics of people. In this study, a new deep learning convolutional neural network is designed to create a new model that can recognize the ethnics of people through their facial features. The new dataset for ethnics of people consists of 3141 images collected from three diﬀerent nationalities. To the best of our knowledge, this is the ﬁrst image dataset collected for the ethnics of people and that dataset will be available for the research community. The new model was compared with two state-of-the-art models, VGG and Inception V3, and the validation accuracy was calculated for each convolutional neural network. The generated models have been tested through several images of people, and the results show that the best performance was achieved by our model with a veriﬁcation accuracy of 96.9%.


Introduction
e scope of face recognition field has been increased recently. Face recognition refers to the ability of identifying any person from an image or a video frame. Many techniques have been used in face recognition. One of the first techniques used is using a 2D pattern recognition problem in which a distance between the important points in an image is used to recognize the face [1], like calculating the distance between eyes and distance between other important points.
Another technique is called holistic matching technique in which complete face region is taken into account as an input data into the catch face system. e most important studies that used this technique are eigenfaces [2], principal component analysis, and linear discriminant analysis [3].
Feature-based structural technique is another technique used in face recognition where the local features of the face are extracted first and their locations and local statistics are fed into a structural classifier. e holistic and feature extraction techniques are used together to make a new technique called hybrid technique in which 3D images are used. e person's face image is caught in 3D; the system after that will note the important features such as curves or shapes in the face. e system after that detects the image whether it is a photograph or real time, determines the location of the face, and measures the curves and shapes of the important features in that face, converting the face into a numerical representation and matching this numerical representation with a dataset of faces. e most important technique in face recognition that has been emerged recently is using the convolutional neural network (CNN) [4]. Although a lot of studies used CNN in face recognition, none of these studies has proposed a robust model to identify ethnics of people through their faces with high classification accuracy for people who have some similarities with different ethics.
Motivated by this, we propose two new models for face recognition with regularization and without regularization, in which they have the ability to recognize the ethnics and origins of people through their faces' facial. To specify, the primary contribution of this paper is proposing a face recognition model that can detect the detailed features of the faces and differentiate between them using RGB images or a real-time face recognition. e ethics of different people can be recognized using this model through extracting the most detailed features of the peoples' faces. A new dataset has been collected for that purpose with high resolution from three different regions in Asia. ese images were collected from social media like Facebook and VK (Russian social media website). Finally, we achieved a promising performance on another dataset collected for the test purpose. e remainder of this paper is organized as follows. Section 2 presents the related works. Section 3 shows the designed network for face recognition. e experiments and results of the new models are given in Section 4. Section 5 concludes the paper.

Related Work
A face recognition method has been presented based on dense grid histograms of oriented gradients (HOG) [5]. In that study, the face image has been divided into many dense grids from which the HOG features have been extracted. After that, all these HOG feature grids vectors are composed to realize the feature expression of the whole face, and the knearest neighbor classifier is used for recognition. e authors used face dataset in the training stage with complex changes in illumination, time and environment, to test the gamma illumination correction, the spatial gradient direction, the size of the block, the standardization, and the face image resolution to find and analyze the optimal HOG parameters for face recognition. e FERET database is a dataset used for facial recognition system evaluation.
ere are many methods in face recognition with high recognition accuracy, which are based on deep learning. One of these methods have a good effect in a restricted environment as well as in the natural environment [6]. e authors improved the method of multipatches by using 4 areas' patches in the face. In order to have a higher performance, they also used a Joint Bayesian (JB) measure in face verification. e model has been trained by the set of CASIA WebFace and tested in the Labeled Faces in the Wild (LFW).
Learning for face recognition has been proposed in another study [7]. e authors argued that the DeepID can be effectively learned through challenging multiclass face identification tasks. Furthermore, the generalization capability of DeepID increases as more face classes are to be predicted at training. ey have used about 10,000 face identifications in the training set. e generated model achieved 79.45% verification accuracy on LFW dataset. e deep ConvNet contains 4 convolutional layers with Maxpooling to extract features hierarchically followed by the fully connected DeepID layer and the softmax output layer indicating identity classes. e developing of an effective feature representations for reducing intrapersonal variations while enlarging interpersonal differences in face recognition has been solved in another study [8] using the deep learning and using both face identification and verification signals as supervision.
e Deep IDentification-verification features (DeepID2) are learned by a deep convolutional network. e face identification task increases the interpersonal variations by drawing DeepID2 features extracted from different identities apart, and the face verification task reduces the intrapersonal variations by pulling DeepID2 features extracted from the same identity. e face verification accuracy that has been achieved by testing the method on LFW dataset [9] was 99.15% and this accuracy is different from the validation accuracy. e error rate has been significantly reduced by 67% as compared to the best previously deep learning results [7]. Another approach for face recognition was presented in which the convolutional neural network (CNN) and a logistic regression classifier (LRC) are combined [4]. e CNN used to extract the features in order to detect and recognize the face images and LRC [10,11] is used to classify the features learned by the convolutional neural network. e structure of the CNN used in this study is composed of four layers: input layer, two convolutional layers, and one subsampling layer.
e first layer is considered as 64 × 64; therefore, the dataset was resized to that size to be compatible with the proposed structure and the output layer is a fully connected layer with 15 feature maps with the size of 1 × 1.
In ours study, we build two models, with dropout and without dropout layers to find out the effect of this layer in the training. is study is concerned with the recognition of the ethnics of people through their facial features through these two models. We used a new CNN with regularization like dropout layers and without regularization to find out the most accurate performance. During training, we used Adam optimizer [12] with a learning rate of 0.001 and categorical cross-entropy loss function. e generated models can detect the detailed features of the faces from RGB images or through a camera.

Ethnics Identification Using Deep
Learning. Our deep learning layers consist of twelve layers. Four of these layers are Conv layers, each followed by the Maxpooling layer, and some of these Conv layers are also followed by the dropout layer after the Maxpooling layer to extract the facial features. A drop connect layer is placed after the four Conv layers as a separator between them and the two fully connected layers. e output of the drop connect layers is passed to a flatten layer to flatten the output before they pass to the first fully connected layer. Between the two fully connected layers, another dropout layer is used. e softmax output layer is used to identify the classes. e purpose of using dropout layers is to get rid of the overfitting during the training. Figure 1 shows the whole structure of the network layers that predict n classes (e.g., n is 3). e number of predicted classes n can be extended to contain as many nationalities as possible.
e input to this network is an image of 128 × 128 × 3 size (e.g., 3 feature maps). e patch size is 3 × 3 with the same padding in every Conv layer and stride is 1 which make the output of the Conv layer roughly the same size as the input.
e output of each Conv layer is passed to Max-2 Scientific Programming Pooling layer to minimize the input size. After that, the output of each Maxpooling layer is fed to ReLU activation function. e Conv layer with feature map equation is where f(x) j(r) is the j th output patch of the convolutional layer in a particular region r and x i(r) is the i th input patch in a particular region r to the convolutional layer. e input of the first convolutional layer is an image of the size 128 × 128 divided into regions according to the size of window patch which is 3 × 3, as it is shown in Figure 1. b j(r) is the bias of the j th output patch in the same particular region r. k ij(r) is the convolution kernel between the i th input patch and the j th output patch, whereas the multiplication of k ij(r) and x i(r) denotes the convolution. e output of each convolutional layer is passed to the Maxpooling. e formula of the Maxpooling layer is as follows: e neurons in i th the output patch f(x) i pool over sz × sz local region in the i th input patch x i . e output of the Maxpooling layer in each Conv layer is passed to ReLU nonlinearity f(x) � max(0, x). e ReLU sets all negative values in the input x to zero and all other values are kept constant, and it shows better fitting abilities than the sigmoid function [13].
Some of the Conv output is passed to a dropout to prevent the overfitting in the network. e number of dropout layers used is three where two of them are used after the second and the third Conv layer, and the third one is used between the last two fully connected layers. e last layers are the two fully connected layers with dropout layer between them. e equation can be represented as follows: where x i− 1 and w i− 1, j− 1 denote the neurons and the weights of the previous layer, respectively. e output of the first fully connected layer is passed to DOut rate where the rate is 0.5 and the output of DOut rate is passed to the last fully connected layer. x i and w i,j denote the neurons and the weights of the first fully connected layer before passing them to the DOut rate layer. e output of the ConvNet is n-way softmax to predict the ethic of the face among n different ethics. e softmax works as follows: where x i is a vector of the inputs to the output layer and it denotes the most important features used to recognize the face. e output of that vector is calculated in x j where x is the index of the output in n, e.g., number of classes.

Dropout Layers in the Network.
Sometimes in the testing phase, the results are not accurate due to the training error. Researchers argue that because of overfitting [14], strong regularization like dropout [15] is used to overcome this problem. e idea of dropout is to drop out some neurons in a neural network wherein neurons are chosen randomly with probability q � 1 − p. When the neuron is dropped out, that means its input and output connection will be ignored and that will allow each neuron to learn something useful on its own without relying too much on other neurons to correct its shortcomings [16,17]. Figure 2 illustrates the idea of dropout. e input and output of each patch are computed as follows before we apply dropout: Scientific Programming where l denotes the index of the network layer. x l+1 is the input patch and y l+1 is the output patch at a hidden layer l � 1, . . . , l − 2 , the layer being l. w l+1 is the weight and b l+1 is the bias. AF denotes the activation function. e following operations occur when the dropout is performed: where ⊕ is the multiplication of an element by element and σ l i is a Bernoulli random variable of the i th neuron at layer l with probability being 1.

Training Two Networks.
e first network is the layers consisting of twelve layers including dropout layers. e training accuracy rate of this network is 96.9% and the validation accuracy rate is 96.9% with a validation loss of 0.221 which means the overfitting has been drastically eliminated as it is shown in Figure 3. In the second network, all the dropout layers are omitted, and the training accuracy is checked. e training accuracy rate in that network is 100%, the validation accuracy rate is 96.9%, and the least validation loss is 0.525. at means the overfitting is very high, and accordingly, the error rate of the created model from that network is more than that in the first network. Figures 3 and 4 show the training accuracy and validation accuracy for each network. e training accuracy in Figure 4 in epoch number 18 is 100% and that accuracy rate did not change until the end of the training which means the overfitting is very high and consequently the error rate is more than the error rate in the first network.

Experimental Training Dataset.
Although there are many large-scale facial image databases available online, but all these databases are not appropriate to meet the objective of this study. erefore, we manually collected 3141 photos from different resources. We collected 1081 Chinese facial images, 1021 Pakistani facial images, and 1039 Russian facial images. After collecting the images, they were processed to extract the faces from the whole images. e total images after that were divided into two sets; the first set was used for training stage and we took 70% of the whole images and the other 30% of the images as the second set for validation stage. Figure 5 shows a subset of the new dataset.

Comparison with the State-of-the-Art Approaches.
Two state-of-the-art approaches were selected, and the last four layers for each approach have been frozen and used our fully connected layers to determine the number of output according to the number of classes in the dataset. ese approaches are VGG [18] and Inception V3 [19]. e training was made in Tesla K80 GPU which is freely provided by Google Colaboratory. e results show that our approach has the highest validation accuracy and the least validation loss. Table 1 shows the results of training of our network and the two state-of-the-art approaches. e comparison between our approach and the two state-of-the-art approaches VGG and Inception V3 is shown in Table 1 where it was observed that our approach has the highest validation accuracy (96.6%) with the less validation loss (0.22) as shown in Figure 3 with regularization. Figure 4 shows that our approach without regularization has the same validation accuracy (e.g., 96.6%), but the loss function value is different (0.525) which indicates that there is an overfitting problem, whereas the validation accuracy of VGG and Inception V3 are (91.48%) and (61.92%) with validation loss of (0.23) and (0.81), respectively, as shown in Figures 6 and 7. Tables 2 and 3 summarize the total number of images for each category, the number of images that are predicted correctly and the number of images that are predicted incorrectly for the two models. e confusion matrix for both models is calculated to visualize the performance of each model. e performance metrics that were widely used to evaluate the predicting results of the models were precision and recall. e results are summarized in Table 4.
Furthermore, a statistical significance test was conducted to compare the results of the two models. From the evaluation, the first model with the dropout layers has the highest accuracy rate with (90.65%), while the second model without the dropout layer has the lowest accuracy rate with (76.70%).
In this study, we need to insert some dropout layers into some specific places on our CNN to overcome the overfitting barrier and get high results. It is difficult to use some CNNs architecture like ResNet or SENet because they are heavy and take long time in training, and it is difficult to control the overfitting problem easily in such architecture due to the  Training and validation accuracy   Scientific Programming 5 difficulties in changing their architecture. VGG and Inception V3 are also very heavy networks in training and it is difficult to change their architecture to control the overfitting problem too. is paper is based on Cohen's methods [20]. Cohen's methods measure the degree of agreements amongst the assigned labels correcting for agreement by chance. In the evaluation, the number of unseen images is 1764, which is not included in the training dataset to evaluate the performance of each model. We found that the number of errors in the image predicting using a second model without dropout layers is larger than the number of errors in the first model with dropout layers.

Conclusions
In this paper, we propose a new deep learning convolutional neural network designed to create a new model that can recognize the ethnics of people through their facial features. e new model is compared with two state-of-the-art models, VGG and Inception V3, and the validation accuracy is calculated for each convolutional neural network. Two models from the proposed convolutional neural network are created with dropout layers and without dropout layers to discover the effect of the regularization in the performance of the models.
A new dataset is collected to use in the training phase to identify the ethnics of people through images from three different regions.
is dataset is considered as the first dataset collected for ethnics of people and that will be available for the research community. Another unseen dataset is collected to evaluate the performance of our two models, and a statistical significance test is conducted to evaluate the performance of the two models.

Data Availability
e collected dataset has been uploaded to the following ULR: https://drive.google.com/file/d/1brRMSh7XDR7h5awgXudQX BqxAIiYSHy_/view?usp=sharing. Disclosure e funding sponsors had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; and in the decision to publish the results.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.