Many types of deep neural networks have been proposed to address the problem of human biometric identification, especially in the areas of face detection and recognition. Local deep neural networks have been recently used in face-based age and gender classification, despite their improvement in performance, their costs on model training is rather expensive. In this paper, we propose to construct a local deep neural network for age and gender classification. In our proposed model, local image patches are selected based on the detected facial landmarks; the selected patches are then used for the network training. A holistical edge map for an entire image is also used for training a “global” network. The age and gender classification results are obtained by combining both the outputs from both the “global” and the local networks. Our proposed model is tested on two face image benchmark datasets; competitive performance is obtained compared to the state-of-the-art methods.
Age estimation and gender distinction from face images play important roles in many computer vision-based applications, such as visual surveillance, security control, and human-computer interaction. Over the last decades, many methods have been proposed to tackle the age and gender classification task.
In early works, pixel intensity values are used directly as input to train a classifier such as neural network [
In recent years, deep learning, especially convolutional neural networks (CNN) [
In order to reduce the cost on CNN model training, a local deep neural network (LDNN) was proposed [
The success of LDNN in age and gender classification and the relative discoveries from the former LDNN works inspire us to propose a LDNN model for age and gender estimation. In our proposed model, the local image patch selection is based on the detected facial landmarks, that is, the image patches used for network learning are dynamically generated. Therefore, the number of image patches can be greatly reduced while all the important information in a face image can be kept.
In [
The remainder of the paper is listed as follows. In Section
The successful applications of CNN on many computer vision tasks have revealed that CNN is a powerful tool in image learning. If enough training data are given, CNN is able to learn a compact and discriminative image feature representation. Therefore, many researchers propose to use CNN in age and gender classification from face images. In this section, the related work on age and gender classification using CNN is briefly reviewed. The previous research on local deep neural networks for age and gender estimation is also introduced.
An early CNN model used for age and gender estimation can be seen in [
Some researchers suggest using deeper networks for age and gender estimation. Yang et al. introduce the deep label distribution learning for apparent age estimation, where the distribution-based loss functions are used for training, which can exploit the uncertainty induced by manual labeling to learn a better model than using ages as the target [
Compare with CNN, LDNNs use a different training strategy: the small image patches around important regions of faces are extracted and used for network learning. An LDNN model for gender recognition is proposed in [
Another LDNN model was proposed recently, which aims to further reduce the number of image patches used for training [
Illustration of the 9 image patches for local neural network training in [
In this section, we describe the proposed architecture for age and gender classification. Our methodology is essentially composed of three steps: (1) to implement face detection and facial landmark localization, (2) to select image patches based on the obtained facial landmarks, (3) and to construct LDNN model. In the following, the three parts are described in detail.
The first step of our proposed model is to detect a face in an image and to obtain the facial landmarks on the face, both are widely investigated areas [
The model is based on a mixture of trees with a shared pool of parts
Each tree-structured pictorial structure [
In (
Equation (
The model is trained in a fully supervised scheme, where the positive images with landmarks and mixture labels are provided, and the negative images without faces are also provided as well. The shape and appearance models are learned by using a structured predication framework. The Chow-Liu algorithm [
Landmark localization of a sample face image; the red points indicate the center of the detected landmark regions.
Once the facial landmarks are obtained, the local image patches can be determined. As shown in Figure
The image patch selection used here is different with the two former LDNN-based methods used in [
Illustration of the five rows split of a face image in [
LDNNs are trained by using the image patches extracted from landmark regions of face images. As most of the redundant information has been discarded in our patch selection process, the left training patches cannot lead to the problem of overfitting. Therefore, it is reasonable to use a simple feed-forward neural network. The network architecture used in [
LDNN model used in [
The whole procedure of our method is shown in Figure
The proposed classification of the main architecture.
The HED is a deep learning-based edge detection method; it aims to obtain a network that learns features from which it is possible to produce edge maps approaching the ground truth. HED uses multiscale and multilevel structure to generate 5 side-outputs which improve the final fusion result. The architecture of HED can be seen in Figure
The HED architecture [
In Figure
A series of experiments has been conducted on two popularly used face image datasets, the LFW database and the Adience database. In this section, the datasets used in our experiments are introduced firstly then the parameter settings of the experiments are introduced. Finally, the experimental results of gender and age estimation are given.
The labeled faces in the wild (LFW) database contains 13,233 face photographs labeled with the name and gender of the person pictured. Images of faces were collected from the web with the only constraint that they were detected by the Viola-Jones face detector [
Sample images from the LFW database.
There are four versions of LFW—the original version, funneled version, deep funneled version, and frontalized version (3D version). LFW is an imbalanced database including 10,256 images of men and 2977 images of women from 5749 subjects; 1680 of which have two or more images [
There are 26,580 face images from 2284 persons in the Adience dataset [
There are three versions of the Adience database, including the original version, aligned version, and frontalized version (3D version) with 26,580, 19,487, and 13,044 images, respectively. The 3D version is used in this work since most images are already frontalized and aligned to the centre of the image. However, images in the Adience database 3D version may be extremely blurry or frontalized incorrectly as shown in Figure
Sample images from Adience database.
There are three subsets of the Adience dataset 3D version; this is because it is not necessary to label gender with age groups or vice versa. The first subset contains 12,194 images labeled with gender. The second subset comprises 12,991 images labeled with age. 12,141 images are included in the third subset, which is labeled as both gender and age. Our experiments are run on the third subset. The label information can be seen in Table
The label information of the Adience subset used in our experiments.
Age group | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Total |
---|---|---|---|---|---|---|---|---|---|
Male | 533 | 693 | 736 | 508 | 1635 | 1011 | 333 | 291 | 5740 |
Female | 494 | 910 | 952 | 699 | 1867 | 875 | 296 | 308 | 6401 |
Total | 1027 | 1603 | 1688 | 1207 | 3502 | 1886 | 629 | 599 | 12,141 |
In order to find appropriate parameters for the proposed method, a series of experiments has been conducted. The parameters listed in Table
The parameter settings in our experiments.
Learning algorithm | SGD + momentum |
---|---|
Dropout probability for input/hidden units | 0.75/0.5 |
Initial learning rate | 3 |
Learning rate update rule | |
Initial/final momentum | 0.5/0.99 |
Number of hidden units | 512 |
Number of hidden layers | 3 |
Activation function | ReLU |
The experiments were run on a PC with an Intel i7 4 cores CPU, 16 G memory and an NVIDIA Geforce GTX 1080 GPU (8 G memory); the time cost for training the proposed model is around 10 hours.
For comparison, we follow the routines in [
The gender classification results on LFW dataset using different hidden layer numbers under the patch size 1313.
Methods compared | LDNN [ |
LDNN + locations [ |
LDNN-F [ |
Proposed | Proposed + locations |
---|---|---|---|---|---|
Accuracy (1 hidden layer) | 91.66 | 92.64 | 94.03 | 94.26 | 94.32 |
Accuracy (2 hidden layers) | 95.35 | 95.98 | 94.22 | 94.85 | 94.82 |
Accuracy (3 hidden layers) | 95.81 | 96.04 | 95.64 | 95.53 | 96.02 |
Accuracy (4 hidden layers) | 95.79 | 95.29 | 95.47 | 95.88 |
In Table
In Table
The effect of using different sizes of image patches are also evaluated. Three hidden layers are used in the network. The compared results can be seen in Table
Performance evaluation of different sizes of image patches.
Patch sizes | |||||
---|---|---|---|---|---|
Accuracy (3 hidden layers) | 95.78 | 95.63 | 95.86 | 95.54 |
In our proposed model, besides the landmark-based image patches, the entire image and the holistic feature map extracted from the entire image are also used to further improve model performance. Table
Classification results from different network combinations.
Network combinations | Entire image | Landmark patches | Entire image + landmark patches | Combined |
---|---|---|---|---|
Classification accuracy | 92.86 | 95.06 | 95.87 |
Some of the state-of-the-art methods work on the LFW dataset for gender classification are also compared in our experiments. The results are listed in Table
Gender classification results on LFW dataset from different methods.
Methods compared | Accuracy (%) |
---|---|
LDNN | 96.25 |
LDNN-F | 95.64 |
Compact CNN [ |
|
LBP + SVM [ |
95.6 |
Gabor + PCA + SVM [ |
94.01 |
Proposed | 96.02 |
The age and gender classification are run on the 3D version of the Adience dataset. We used the same routine in [
The same parameters listed in Table
Gender classification results on the Adience dataset.
Compared methods | Entire image | LDNN-F [ |
Proposed |
---|---|---|---|
Classification accuracy | 77.84 | 78.63 |
The Aidence dataset contains 8 age groups and another 20 different age labels. Some folds even lack the images for some age groups; therefore, the age labels must be merged. We used the same merging scheme used in [
In the same way in [
For the age estimation, three sets of neural networks are constructed; each contains the model shown in Figure
The age estimation scheme.
It should be noted that the neural network 2 in Figure
Age estimation results from the two neural networks for men and women respectively.
Rate method | Neural network 1 (male’s age) | Neural network 2 (female’s age) | ||
---|---|---|---|---|
Exact | One-off | Exact | One-off | |
Entire image | 38.94 | 77.76 | 36.68 | 75.12 |
LDNN-F [ |
39.90 | 80.32 | 41.27 | 77.14 |
Proposed | 41.86 | 81.87 | 42.79 | 78.65 |
Age estimation results from the proposed age estimation model compared with other CNN-based methods.
Methods | LDNN-F | CNN [ |
Proposed | |||
---|---|---|---|---|---|---|
Exact | One-off | Exact | One-off | Exact | One-off | |
Accuracy | 41.82 | 77.98 | 45.1 ± 2.6 | 79.5 ± 1.4 |
A modified version of local deep neural networks is proposed in this paper. Instead of using location-fixed patches, the facial landmarks are detected in advance, then the image patches around landmarks are selected for network training, which greatly reduces the training cost. Moreover, the experimental results show that the method proposed in this paper achieves competitive performance in the two tested datasets. The performance of the proposed model still can be improved by incorporating other schemes into current architecture, for example, to use a more efficient facial landmark detection method or to further optimize the network structure, these will be investigated in our future work.
The data used to support the findings of this study are available from the corresponding author upon request.
The ownership of Figures
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The project is funded by Natural Science Foundation China Grant no. 61462097 and Yunnan Provincial Education Department research Grant no. 2018JS143.