Analysis of Facial Images across Age Progression by Humans

The appearance of human faces can undergo large variations over aging progress. Analysis of facial image taken over age progression recently attracts increasing attentions in computer-vision community. Human abilities for such analysis are, however, less studied. In this paper, we conduct a thorough study of human ability on two tasks, face veriﬁcation and age estimation, for facial images taken at di ﬀ erent ages. Detailed and rigorous experimental analysis is provided, which helps understanding roles of di ﬀ erent factors including age group, age gap, race, and gender. In addition, our study also leads to an interesting observation: for age estimation, photos from adults are more challenging than that from young people. We expect the study to provide a reference for machine-based solutions.


Introduction
Human faces are important in revealing the personal characteristic and understanding visual data. The facial research has been studied over several decades in computer vision community [1,2]. Analysis facial images across age progression recently attracts increasing research attention [3] because of its important real-life applications. For example, facial appearance predictor of missing people and ID photo automatic update system are playing important roles in simulating face aging of human beings. Age estimation can also be applied to age-restricted vending machine [4]. Most recent studies (see Section 2) of age-related facial image analysis mainly focus on three tasks: face verification, age estimation, and age effect simulation. In comparison, it remains unclear how humans perform on these tasks.
In this paper, we study human ability on face verification and age estimation for face photos taken at across age progression. Such studies are important in that it not only provides a reference for future machine-based solutions, but also provides insight on how different factors (e.g., age gaps, gender, etc.) affect facial analysis algorithms. There are previous works on human performance for face recognition and age estimation; however, most of them are either focusing on nonage related issues such as lighting [5] or limited by the scale of image datasets (e.g., [6]). Taking advantage of the recent available MORPH dataset [7], which to the best of our knowledge is the largest publicly available face aging dataset, we are able to conduct thorough human studies on facial analysis tasks.
For face verification, the task is to let a human subject decide whether two photos come from the same person (at different ages). In addition to report the general performance on our human subjects' performance, we also analyze the effects of difference factors, including age group, age gap, race, and gender. In addition, we also compare human performance with previous reported baseline algorithm. For age estimation, similarly, we report and analyze human performance for general cases as well as for different factors. Compared to a previous study on the FGNet database [8], our study implies that age estimation are harder for photos from adults than those from young people.
The rest of the paper is organized as follows. Section 2 shows the related works on different databases. Section 3 describes the details of human experiments of face-recognition and age-estimation problems. Then, in Section 4, the results are compared to the existing results of human experiments and computer algorithms. The conclusion is given in Section 5.

Related Works
Face recognition across age progression and age estimation have been studied widely in recent years. Large numbers of algorithms have been implemented based on different databases [9]. One of the earliest works in Lanitis et al. [3] uses a statistical model to capture the variation of facial shapes over age progression. Then, the model is used for age estimation and face recognition on a database containing 500 face images of 60 subjects. In [10], Ramanathan and Chellappa use the probabilistic eigenspace framework for face recognition. Ling et al. [11] proposes gradient orientation pyramids operators derived from multiple resolutions and then uses SVM to perform face recognition experiments. These two algorithms are conducted on a private passport database. A recent work in Biswas et al. [12] studies feature drifting on face images at different ages and applies it to faceverification tasks. Other studies using age transformation for recognition include [13][14][15][16].
For age-estimation problem, Fu and Huang [17] construct a low-dimensional manifold from a set of ageseparated face images to estimate the ages of faces. Manifold learning approach adopted in Guo et al. [18] is to estimate the age from the low-dimensional representation of faces. Hybrid features are recently included to further improve the estimation accuracy [19]. Other related researches in age estimation can be found in [20][21][22][23].
A major issue in the research of age-related facial image analysis is the database. For a long time, the FGNet face aging database [8], which is collected by Lanitis and colleagues, is the only publicly available database dedicated to face aging study (some other public databases, e.g., the FERET database [24], also contain limited number of facial photos of the same person taken at different ages, and they are designed for more general facial analysis tasks other than aging). The dataset contains about 1000 facial photos from 82 subjects taken at different ages. Since its introduction, the FGNet database has been widely used in face aging analysis [25][26][27]. Recently, Ricanek and Tesafaye introduce the MORPH Database [7], which contains across age photos from a large amount of subjects (see Section 3). Albert and Ricanek [28] implemented a baseline facial verification on the MORPH database using the eigenface algorithm. In our experiment, the MORPH database is used, because it involves more subjects as well and larger age ranges than other public and private databases.
The most related works to ours are previous studies of human ability for face recognition (with age progression) and age-estimation. In [25], 30 subjects participated with 100 pairs of face images randomly selected from the FGNet database. For each test pair, subjects were requested to tell if the two images are from the same person and if the images belong to the same age group. The human performance is compared to Support Vector Machine classifiers using Mahalanobis distance. Experiments demonstrated that the SVM classifiers could perform better than the human performances. Geng et al. [6] collected the human performance of an age-estimation experiment, where 29 subjects were asked to estimate the ages of images from the FGNet database. Then, they provided a method by learning a representative subspace to model the aging pattern. Experiments showed promising performances to the results of human experiments. Compared to previous work, our study is more thorough in several aspects: (1) a much larger database is used, (2) we conducted experiments on both face-verification and age-estimation tasks, and (3) rigorous statistical analysis of the experimental results is provided.

Data and Subjects.
To conduct a thorough study, we use facial images from the recently available MORPH face aging dataset [7]. The dataset contains two albums; we use the first one (i.e., MORPH Album 1), since a baseline face verification result is provided on it [28]. MORPH Album 1 contains 1690 facial photos of 515 individuals with ages in the range of 15-68 years (see Figure 1 for typical examples). All images are in gray scale. We have removed 92 blurred and noisy photos (see, e.g., Figure 2). In addition to ages, the metadata in the MORPH dataset also provides detailed information about the subject (e.g., gender and race) and the photographic conditions (e.g., pose, lighting conditions, and image quality).
One interesting difference between the MORPH dataset and the FGNet dataset is that photos from MORPH are mainly from adult persons, while most photos from FGNet are dominated by children. In this way, by comparing with previous work on FGNet, our study on MORPH provides a comparison of the performances on photos from children versus those from adults.

Face-Verification Experiment.
In the face-verification experiment, each participant attends 300 trials, and therefore, the total number of the trials is 300 × 31 = 9300. In each trial, a pair of face images is randomly selected from the database and then presented to the participant. The participant is requested to decide if the two photos come from the same person. Among the 300 trials, about 30% are from the same persons. The user interface of the experiment is shown in Figure 3(a). When a pair of photos is shown, the participant is required to click either the "Same" button, if they think the two photos are of the same person, or the "Diff " button, if they think otherwise. The choice and reaction time of each trial is recorded.

Age-Estimation Experiment.
In the age-estimation experiment, similar to the verification experiment, each participant attends 300 trials, and there are 9300 trials in total. In each trial, a participant is requested to estimate the age of a photo randomly selected from the whole database. The user interface of the experiment is presented in Figure 3(b). Given a photo, the participant is request to "choose" among buttons named "1" to "80", corresponding to age 1 to 80, respectively. Note that the actual age range of the MORPH dataset is 15-68. We purposely allow a participant to chose from a larger range to avoid bias.

4.1.1.Face-Verification Performance.
To measure the face-verification performance, we report the average face-verification accuracy over all the participants. The accuracy represents the participants' correct selections-"Same" for photos from the same person or "Diff " for those from different persons.
In addition to the performance on the whole dataset, we also report the performance on the subgroups of the dataset to study how gender and race affects human accuracy. These subgroups include African American versus European American and male versus female. The results are summarized in Table 1. Overall, from the table, we see that neither gender nor race has significant effects on the verification task. Several examples that are incorrectly verified during the test are given in Figure 4 (false positive) and Figure 5 (false negative). A similar human study was reported by Lanitis [25] on the FGNet database, where an overall accuracy of 66.9% is achieved for images in similar conditions as ours (i.e., cropped grayscale images). Given the fact that FGNet contains much more child photos than does MORPH, this observation shows that verification of a person using child and adult photos is more challenging than only adult photos.

African female
European female African male European male Figure 4: Some typical photos that lead to false-positive input (participants chose "same" when the pairs actually are not the same).

African female European female
African male European male Figure 5: Some typical photos that lead to false-negative input (participants chose "Diff " when the pairs actually are the same). Table 2: Face-verification accuracy results of human and a baseline algorithm [28]. We also compare the result on MORPH using baseline machine solutions reported in [28]. In their experiment, a standard eigenface algorithm was implemented and achieved an overall accuracy of 38.1%, which is much lower than the human performance (78.8%). To further study roles of different factors in the verification performance, we follow [28] to divide the data into several groups according to different criteria, including ages, age gaps, races, and genders. Specifically, we divide the photos into four age groups: younger than 18 (<18), from 18 to 29 (18)(19)(20)(21)(22)(23)(24)(25)(26)(27)(28)(29), from 30 to 39 (30-39), and older than age 40 (≥40). The age gap is defined as the absolute age difference between two photos from the same person. Age gaps between the first photos and the second photos in our experiment are also divided to four gap groups: less than 6 years (named gap 0-5), from 6 to 10 years (named gap 6-10), from 11 to 15 years (named gap [11][12][13][14][15], and more than 16 years (named gap 16+). The verification accuracies on each groups in our experiment are reported in Tables 2-6 to show the effect of different ages and age gaps on the accuracies. In particular, Table 2 shows the verification accuracies for each joint age and age gap group. Tables  3-6 show these accuracies for African American photos, European American photos, American male photos, and Table 3: Face-verification accuracy results of human and a baseline algorithm [28] on African American photos.

Age
Gap American female photos, respectively. In addition, the results from [28] are included for comparison wherever available.
In general, our experiment shows that elder persons are better verified than younger persons. For example, the accuracy of age group 18-29 is higher than the group <18 but lower than the group 30-39. This observation is consistent with previous studies in human and machine face recognition [1].
By comparing Table 3 versus Table 4 and Table 5 versus Table 6, we observe that the races and genders do not affect the human ability of face verification. This is consistent with the conclusion drawn from Table 1.

Statistical Analysis.
We now analyze the result statistically. Specifically, we investigate the relationship between the user-related information and the photo-related profile, such as the race, gender, and the age gap of pairs of photos. The user-related information includes the correctness of the participant response and the reaction time of the response. Multiway ANOVA test is adopted for this purpose. Because the accuracy of the participant response is the most important, it is used as the dependent variable. Independent factors include the mutual effect of the gender and race of the photos, the participant response time (in seconds), the age gap, and the gap group of photo pairs. The ANOVA result is shown in Table 7. It shows that P < 0.0001 in the model,     which means that the model is statistically significant. The P value demonstrates the effect of independent variables on dependent variables. Here, the response time has significant effect on the correctness. The mutual effect of race and gender is also playing a major factor in human performance, and the age factor of the photo is important too. We are also interested in whether the response time is affected by the mutual effect of the race and the gender. For this, we use one-way ANOVA. In this model, the overall F-value is 12.51, P = 0.0004. The partial F-value of the dependent variable (the mutual effect) is 12.51, P = 0.0004. Thus, both the model and the dependent variable are statistically significant, which indicates that race and gender do affect the response time.

Age-Estimation Performance.
To measure the ageestimation performance, we use the same criterion, mean absolute error (MAE), used in previous studies [3,6,9]. Formally, MAE is defined as   where Estimated Age k and True Age k are the estimated age and the true age, respectively, of the photo used in the kth trial and n is the number of trials. The experiment result is shown in Table 8, where it shows that the average MAE is 8.58 years. Some example photos with large estimation errors are shown in Figure 6. For comparison, the table also lists the MAE on the FGNet database [6], in which 30 participants attended a similar human experiments. In [6], the experiments are conducted on two types of images: one with cropped grayscale images and the other with uncropped color images. The cropped grayscale images are similar to images in the MORPH dataset. Table 8 shows that the MAE on the MORPH dataset (our experiment) is larger than those on the FGNet dataset. Given the fact that FGNet contains photos that are in average much younger than those in MORPH, we conclude that ages of adults are harder to be estimated than do ages of children.
To further investigate the gender and race factors in the age-estimation task, we summarize the MAE for each subgroup in Table 9. It shows that ages of European Americans are, in average, harder to be estimated than ages of African Americans. Moreover, it also shows that it is harder to estimate ages of male photos than of female photos. The significance of the conclusion is statistically verified in the following subsection.

Statistical Analysis.
We use statistical model to analyze the age-estimation result. In the multiway ANOVA analysis model, the dependent variable is the absolute difference between a participant's answer and the actual age of the trial photo. The gender, race, and the response time in seconds are independent variables. The result is reported in Table 10. As shown in the table, the gender and race separately have significant effect on the dependent variable though the participant response time is not related. This confirmed our observation that genders and faces do affect the performance of age estimation.

Summary of Observations.
We summarize the observations in the above experiments, in addition to the performance scores, as below.
(i) For humans, face verification for young people is more difficult than for senior people.
(ii) Human ability for age estimation is affected by the race and gender of the person in the photo to be estimated. In average, for age estimation, photos of European Americans are harder than those of African Americans; photos of male subjects are harder than photos of female subjects.
(iii) For humans, it is relative easier to estimate ages of children than to estimate ages of adults. This could be due to the fact that facial profiles usually stay stable after age 18 [29].

Conclusion
In this paper, we describe our study of human ability of facial image analysis for photos across age progression. We implemented two experiments including face-verification task and age-estimation task. In addition to report the overall performance, we also investigate how different factors affect human performance. These factors include the age of person in a photo, the age gaps at which two photos are generated, the gender, and the race. We expect our study to provide a reference for future studies of related topics.