Deep Convolutional Neural Network Used in Single Sample per Person Face Recognition

Face recognition (FR) with single sample per person (SSPP) is a challenge in computer vision. Since there is only one sample to be trained, it makes facial variation such as pose, illumination, and disguise difficult to be predicted. To overcome this problem, this paper proposes a scheme combined traditional and deep learning (TDL) method to process the task. First, it proposes an expanding sample method based on traditional approach. Compared with other expanding sample methods, the method can be used easily and conveniently. Besides, it can generate samples such as disguise, expression, and mixed variation. Second, it uses transfer learning and introduces a well-trained deep convolutional neural network (DCNN) model and then selects some expanding samples to fine-tune the DCNN model. Third, the fine-tuned model is used to implement experiment. Experimental results on AR face database, Extend Yale B face database, FERET face database, and LFW database demonstrate that TDL achieves the state-of-the-art performance in SSPP FR.


Introduction
As artificial intelligence (AI) becomes more and more popular, computer vision (CV) also has been proved to be a very hot topic in academic such as face recognition [1], facial expression recognition [2], and object recognition [3]. It is well known that the basic and important foundation in CV is that there are an amount of training samples. But in actual scenarios such as immigration management, fugitive tracing, and video surveillance, there may be only one sample, which leads to single sample per person (SSPP) problem such as gait recognition [4], face recognition (FR) [5,6], and low-resolution face recognition [7] in CV. However, as the widely use of second-generation ID card which is convenient to be collected, SSPP FR becomes one of the most popular topics no matter in academic or in industry.
Beymer and Poggio [8] proposed one example view problem in 1996. In [8], it was researched that how to perform face recognition (FR) using one example view. Firstly, it exploited prior knowledge to generate multiple virtual views. en, the example view and these multiple virtual views were used as example views in a view-based, pose-invariant face recognizer. Later, SSPP FR became a popular research topic at the beginning of the 21st century.
Recently, many methods have been proposed. Generally speaking, these methods can be summarized in five basic methods: direct method, generic learning method, patchbased method, expanding sample method, and deep learning (DL) method. Direct method does experiment based on the SSPP directly by using an algorithm. Generic learning method is the way that using an auxiliary dataset to build a generic dataset from which some variation information can be learned by single sample. Patch-based method partitions single sample into several patches first, then extracts features on these patches, respectively, and does classification finally. e expanding sample method is with some special means such as perturbation-based method [9,10], photometric transforms, and geometric distortion [11] to increase sample so that abundant training samples can be used to process this task. e DL method uses the DL model to perform the research.
Attracted by the good performance of DCNN, inspired by [12] and driven by AI, in this paper, a scheme combined traditional and DL (TDL) method is proposed. e framework of TDL is illuminated in Figure 1. First, an expanding sample method is proposed to increase the sample to overcome the shortage of sample in SSPP FR. Second, a learned DCNN model is brought in, and then some expanding samples are selected to fine-tune the model. Finally, the fine-tuned model is used to perform experiment.
is is an extended version of our conference papers [13,14]. e contributions of this paper are shown as follows: (i) We propose a novel expanding sample method.
Compared with other expanding sample methods, it is more easier and convenient to be used. e remaining parts of the paper are structured as follows. Session 2 introduces related works. Session 3 presents the expanding sample method. Session 4 presents the deep learning method. Session 5 implements experiments. Session 6 concludes the paper and indicates the future work.

Related Works
In recent years, many scholars in the world devoted themselves to SSPP FR, and some good performances were obtained. Deng et al. [15] proposed extended sparse representation-based classifier (ESRC) method to classify query sample and gallery sample. With the help of an auxiliary training set, it used variations of the auxiliary training set to represent those that lack variations of the gallery set. Lu et al. [16] proposed a novel discriminative multimanifold analysis (DMMA) method. It obtained patches of training sample by segmenting image, and then these patches were used to learn discriminative features. Mohammadzade and Hatzinakos [17] learned expression invariant subspace to keep expression invariant. It pointed out that the same expression has the same expression subspace, and it can generate a new image by projecting an expression image to expression subspace. Yang et al. [18] proposed sparse variation dictionary learning (SVDL) method. It connected generic set and gallery set adaptively by jointly learning a projection, rebuilding a sparse dictionary including adequate variations, and performing SSPP FR by projecting variation dictionary to gallery set space. Li et al. [19] developed linear discriminant analysis (LDA) to process the SSPP FR problem and produced extrauseful training samples in low-dimension subspace by using random projection. Zhu et al. [6] proposed a framework based on local generic representation to solve the SSPP FR problem. It used the same way as ESRC to build intraclass variation dictionary and proportioned the face image into several patches to extract local information. Liu et al. [20] proposed a fast FR method based on DMMA. First, it clustered two groups of persons using a rectified K-means method. Second, it partitioned the face image into several nonoverlap patches, and then DMMA was applied on these patches. ird, fast DMMA was obtained by repeating the former two steps. Liu et al. [21] solved the SSPP FR problem by using sparse representation-based classifier (SRC) and local structure. It relieved the trouble that had highdimension data and few samples. Mokhayeri et al. [22] expanded the training set by using an auxiliary set. Gao et al. [23] presented a regularized patch-based representation method. A collection of patches are used to represent each image; meanwhile, under the gallery image patches and intraclass variance dictionaries, their sparse representations are sought. Song et al. [5] proposed a triple local featurebased collaborative representation method to make full use of the training sample. First, it extracted different types of Gabor features including different scales and different directions. Second, it partitioned each Gabor feature into several local patches to obtain triple local features including local scale, local direction, and local space. ird, it did local collaborative representation and classification based on these triple local features. Zhang and Peng [24] used deep autoencoder to generalise intraclass variations, and then these intraclass variations were used to reconstruct new samples. First, images in the gallery are used to train a generalised deep autoencoder. Second, each person's single sample is used to fine-tune a class-specific deep autoencoder (CDA). ird, the corresponding CDA is used to reconstruct new samples. Finally, these reconstructing new samples are used to do the classification task. Gu et al. [25] proposed local robust sparse representation (LRSR) method. It combined a local sparse representation model and a patch-based generic variation dictionary learning model to predict the possible facial intraclass variation of the query images. Ding et al. [26] partitioned the aligned face image into several nonoverlapping patches to form the training set, then utilized a kernel principal component analysis network to obtain filters and feature banks, and at last, used weighted voting method to occur in the identification of the unlabeled probe. Based on a robust representation and probabilistic graph model, Ji et al. [27] proposed an algorithm to address this problem. ey used label propagation to construct probabilistic labels for the samples in the generic training set corresponding to those in the gallery set. At the classification stage, a reconstruction-based classifier is used. Inspired by discriminant manifold learning and binary encoding, Zhang et al. [28] constructed local histogram-based facial image descriptors. ey partitioned every image into several nonoverlapping patches, found a matrix to project these patches on to an optimal subspace to maximize manifold margins of different people, reshaped each column of the matrix to an image filter to process facial images, and binarized the responses corresponding to these filters according to thresholding. In classification, they computed region-wise histograms of pixels' binary codes and concatenated them to form the representation of tested image. Dong et al. [29] proposed k nearest neighbor virtual image set-based multimanifold discriminant learning method. ey put forward a virtual sample generating algorithm to enrich intraclass variation information for training samples inspired by the fact that similar faces have similar intraclass variations. Otherwise, they come up with image set-based multimanifold discriminant learning algorithm to use the intraclass variation information.
However, most of these methods are traditional methods, and there are few DL methods which are very active in CV recently and have a good performance in CV task. Gao et al. [12] proposed a DL method to solve the SSPP FR problem via learning deep supervised autoencoders. Firstly, a supervised autoencoder enforced facial variations to be mapped with canonical face of the same person and enforced the features of the same person to be similar. en, such supervised autoencoders were stacked to obtain deep architecture. Finally, the supervised autoencoder with deep architecture was used to extract features. Recently, there is no DCNN method to process this task, but due to its good performance in CV, it will become a promising method.

Expanding Sample Method
In order to overcome the lack of the training sample in SSPP FR, we propose an expanding sample method. It firstly learns an intraclass variation set, and then the intraclass variation set is added to single sample to expand sample. Its principle diagram is illustrated in Figure 2. e details of generating intraclass variation set are as follows.
First, generate intraclass variation images according to images of an extrafrontal face dataset. Suppose that there are m subjects in an extrafrontal face dataset, each subject has (n − 1) variation images and one neutral image, so we can use X to express the dataset; let X ij represent the ith person's jth variation image, where i ∈ [1, m], j ∈ [1, n], and let j � 1 represent the neutral face. We use variation image of the database (X ij , j ≠ 1) minus its corresponding neutral image (X i1 ); thus, we get variance of the variation image relative to its neutral image, as follows: which represents the ith subject's jth intraclass variation image relating to its neutral image. en, find the average intraclass variation image that has the same variation in these intraclass variation images to decrease the error of intraclass variation image, as follows: Finally, construct an intraclass variation set according to these learned average intraclass variation images in the forward step. It is shown as follows: e specific steps of generating intraclass variation set are summarized in Table 1. e framework of generating intraclass variation set is illustrated in Figure 3.
Later, with the help of C++ and MATLAB, the face image is detected and cropped from the new input face image, and then the face image is resized to the same size with the intraclass variation set. At last, the intraclass variation set is added to the aligned face image for expanding image as follows: where X k1 represents the neutral face image of the person k and D ek represents the expanding samples of the person k.
According to the method, single sample is expanded to many samples. e framework of expanding sample is shown in Figure 4.

Deep Learning Method
As DCNN needs a large amount of samples to be trained, it is difficult to be used in SSPP FR. In order to solve this problem, firstly, we use transfer learning to introduce a well-Computational Intelligence and Neuroscience trained DCNN. en, we select some expanding samples to fine-tune the learned DCNN. Finally, we use the fine-tuned DCNN to implement experiment.

Transfer Learning.
Transfer learning uses knowledge learned from one specific scene to help another application scenario. In other words, it uses auxiliary data to learn a model or mapping and then uses the model or mapping to do a new task.
Since there is one training sample in SSPP FR, DCNN which needs abundant training data is difficult to be used. erefore, we use transfer learning to introduce a welltrained DCNN model. Here, we have the aid of a lightened CNN [30] which can learn a compact embedding for face recognition to do the research.
Different from other DCNN models, the lightened CNN introduces a new activated function named Max-Feature-Map which introduces maxout in the fully connected layer to the convolution layer. Given an input convolution layer C ∈ R h×w×2n , the Max-Feature-Map activation function can be written as follows: where the channel of the input convolution layer is 2n, e architecture of the lightened CNN is illustrated in Figure 5.

Fine-Tuning.
e lightened CNN is trained by CASIA-WebFace database. e CASIA-WebFace database contains 10,575 persons and has a total of 493,456 face images. Before it is used to train the lightened CNN, it is firstly preprocessed. e preprocessing includes the images that are converted to grayscale images and normalized to 144 × 144. After it is preprocessed, it is used to train the lightened CNN. Later, a well-trained model is obtained. We use the welltrained model to do the fine-tuning task. Some expanding samples are selected and put into the well-trained model to do fine-tuning. And the fine-tuned model is used to implement experiment.
Since TDL is regarded the proposed method, the expanding sample method is proposed for TDL, so when these methods are used to be compared, these are not using the generated training images. But the expanding sample method has been demonstrated that it has a good performance compared with the direct method [50].

Similarity.
Here, we use AR face database to produce intraclass variation set. To describe briefly, the expanding images are numbered as 1, 2, 3, . . . , 26 based on their types of variation. eir meanings are described as follows: 1: neutral expression, 2: smile, 3: anger, 4: scream, 5: left light on, 6: right light on, 7: all side light on, 8: wearing sunglasses, 9: wearing sunglasses and left light on, 10: wearing sunglasses and right light on, 11: wearing scarf, 12: wearing scarf and left light on, 13: wearing scarf and right light on, and 14 to 26: same conditions as 1 to 13 but not in the same period. We divide these images into two sessions, session 1 and session 2. Session 1 includes 1 to 13, and session 2 includes 14 to 26. In order to evaluate the similarities between expanding samples and actual images, an algorithm is proposed. e details of measuring similarities between expanding samples and actual images are as follows.
First, calculate the Euclidean distances between expanding samples and actual images E d . Suppose that there are m persons and n variations, we label expanding samples as D e and label actual samples as D a . We use every pixel of (2) calculate:   Computational Intelligence and Neuroscience the ith person's image with the jth variation in expanding samples D eij minus the corresponding pixel of the ith person's image with the jth variation in actual images D aij . So we get the Euclidean distance of the ith person with the jth variation image between expanding sample and actual image E dij , as follows: where i ∈ [1, m], j ∈ [1, n]. Second, calculate average Euclidean distance of the jth variation E dj which is used as the threshold of the jth intraclass variation, as follows: ird, count the number of similar images. Let N j represent the similar number of the jth variation image. When the Euclidean distance E dij is bigger than the threshold of intraclass variation E dj , it is regarded that the expanding sample is not similar to the actual image. Otherwise, it is similar as follows: Finally, calculate the similarity of the jth variation between expanding samples and actual samples η j , as follows: Its specific steps are shown in Table 2. e thresholds of intraclass variation and the similarities are shown in Tables 3 and 4, respectively. Table 4, we can see several similarities are very low, which may be detrimental to the experimental results, so it is necessary to select the best intraclass variation set.
In order to test the influence of these expanding samples, we test the accuracies and losses in session 1 and session 2 by using Part I, Part II, Part III, Part IV, Part V, and Part VI to fine-tune the lightened CNN model, respectively. ese finetuned models are used to implement experiment on AR face database, respectively. e accuracies and losses are shown in Figures 6-9, respectively.
According to Figures 6-9, we can find that the accuracies in Figure 6 are the highest when the fine-tuning number is 1800, so does in Figure 7. We also find the errors in Figure 8 are the lowest when the fine-tuning number is 1800, so does in Figure 9. All in all, Part I is selected to implement experiment. Correspondingly, these models which are used to produce Part I is selected as the final version of intraclass variation set.
We can see from Table 5 that the direct method has a poorest performance among these methods, and patchbased method is better than generic learning method. e patch-based method TLC outperforms the generic learning method LGR by 0.4%, 0.6%, and 1.8% under expression, disguise, and illumination with disguise conditions, respectively. But under the same conditions, TDL outperforms TLC by 1.7%, 0.1%, and 1.2%, respectively. Besides, we find that the accuracies under expression and illumination conditions achieve 100%.
In Table 6, we can find that the patch-based method TLC is very competitive, and it outperforms the generic learning method LGR by 1.7%, 2.1%, 2.5%, and 3.1% under different    Fine-tuning numbers Computational Intelligence and Neuroscience conditions, but the proposed TDL outperforms TLC by 0.8%, 12.9%, 3.7%, and 7.4%, respectively. Especially, the accuracies obtained by using TDL achieve 100% under illumination, expression, and disguise conditions. e accuracies in Table 5 and Table 6 are very high. On the one hand, it is because the images in AR face database were taken under strictly controlled conditions. On the other hand, the intraclass variation set has the same variations as the images of AR face database.

Extend Yale B Face Database.
Extend Yale B face database contains 38 subjects, and each subject has 64 images under different pose and illumination conditions. Different from other experiments that using one part of the database as testing samples and another as generic samples and training samples, in the experiment, the intraclass variation set is added to the neutral and normal illumination image of each subject to obtain adequate training samples, and the rest of the database is used as testing samples. ese expanding samples are used to fine-tune the well-trained DCNN model, and then the fine-tuned model is used to perform experiment. e accuracies obtained by using different methods are shown in Table 7.
We can find that the direct method still has the lowest recognition rate and DL method SSAE is better than direct method; however, the generic learning methods SVDL and LGR outperform SSAE by 2.8% and 4.4%, respectively. But TDL outperforms SVDL and LGR by 3.3% and 1.7%, respectively. We also find that the accuracy on Extend Yale B face database is lower than that on AR face database. For one thing, these expanding samples have no same variation as testing samples. For another, Extend Yale B face database has a greater degree of change corresponding to its neutral images compared with AR face database.

FERET Face Database.
FERET face database contains 200 subjects with 1400 images under different pose, expression, and illumination conditions. e neutral and normal image of each person is used as single sample to expand sample by adding the intraclass variation set to it. e rest is used as testing samples. ese expanding samples are used to fine-tune the DCNN model. en, the fine-tuned DCNN model is applied to implement experiment. Table 8 lists the accuracies of different methods.
We can see from Table 8 that the direct method consistently performs worst than other methods. Expanding sample method also exhibits worse results. e expanding sample method SVD-LDA outperforms the direct method PCA by 1.5%; however, the best direct method SOM outperforms SVD-LDA by 5.5%, but the patch-based method DMMA outperforms SOM by 2%. e proposed method TDL achieves the best performance and outperforms the second DMMA by 0.9%.

LFW Database.
e LFW database contains 1680 subjects with more than 13000 images which were collected from Web and had many unconstrained conditions. Followed by [6], LFW-a is used to implement experiment. We select 50 persons from LFW-a who have more than 10 images to do experiment. ese images are preprocessed before being used. First, the face images are cropped. Second, the cropped face images are resized to 144 × 144. ird, the intraclass variation set is added to one image of each person to get more training samples. Finally, these expanding samples are used to fine-tune the DCNN model, and then the remaining images of the database are tested on the finetuned model. Table 9 presents the accuracies obtained by different methods.
We can find that all the accuracies are very low and none of them overtakes 31%; however, the proposed method TDL achieves the best which is 74% and outperforms the second LGR by 43.6% more than 2 times. Particularly, the LFW database is taken under unconstrained conditions. e experimental result proves that although the intraclass variation set is obtained by constrained images, it also can be used in unconstrained conditions. From Tables 7-9, we can find that TDL has the best performance compared with other method, although the intraclass variation set is obtained by another database. On the one hand, it demonstrates that the intraclass variation set has a wide range of practicability. On the other hand, it shows that TDL has a better generic ability.
From Tables 5-9, we find that the direct method is the poorest method, expanding sample method is the second  Computational Intelligence and Neuroscience poorest method, generic learning method is more better than expanding sample method, patch-based method is the best method among these methods, and the DL method SSAE performs worse than generic learning method, but the proposed method TDL is better than patch-based method. It says that TDL not only outperforms expanding sample method but also has a better performance compared with direct method, generic method, patch-based method, and another DL method. Otherwise, we also find that recognition rates on AR face database are very high which is because the intraclass variation is learned from the same database, recognition rate on LFW database is the lowest among these database which is because the assumption of the model is to deal with frontal faces, so the final system is only working with frontal faces, when it is tested on LFW database which concludes nonfrontal faces the recognition rate dropped sharply.

Conclusion and Future Work
In this paper, we propose a scheme combined traditional and DL (TDL) method for single sample per person (SSPP) face recognition (FR). First, a novel expanding sample method is proposed to increase training sample. Second, similarities between expanding samples and actual samples are validated, and then the best intraclass variation set is selected as expanding sample model based on the similarity and performance on these actual samples. ird, the selected intraclass variation set is used to expand training sample, and then the DCNN model is fine-tuned. Finally, experiments are implemented on the fine-tuned DCNN model. Extensive experimental results on several databases including AR face database, Extend Yale B face database, FERET face database, and LFW database demonstrate that TDL achieves the stateof-the-art performance among these methods in SSPP FR. Besides, this paper is a pioneer that uses DCNN in SSPP FR, which makes it possible that DCNN is used in single sample or few samples. In the future, on the one hand, a research on how to improve its accuracy and practicability will be continued, and on the other hand, a research on how to strictly carry out the alignment between the new image and the reference images will also be continued.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Table 9: Accuracy on LFW database.