Cross-Modality 2 D-3 D Face Recognition via Multiview Smooth Discriminant Analysis Based on ELM

In recent years, 3D face recognition has attracted increasing attention from worldwide researchers. Rather than homogeneous face data, more and more applications require flexible input face data nowadays. In this paper, we propose a new approach for cross-modality 2D-3D face recognition (FR), which is called Multiview Smooth Discriminant Analysis (MSDA) based on Extreme Learning Machines (ELM). Adding the Laplacian penalty constrain for the multiview feature learning, the proposed MSDA is first proposed to extract the cross-modality 2D-3D face features. The MSDA aims at finding a multiview learning based common discriminative feature space and it can then fully utilize the underlying relationship of features from different views. To speed up the learning phase of the classifier, the recent popular algorithm named Extreme Learning Machine (ELM) is adopted to train the single hidden layer feedforward neural networks (SLFNs). To evaluate the effectiveness of our proposed FR framework, experimental results on a benchmark face recognition dataset are presented. Simulations show that our new proposed method generally outperforms several recent approaches with a fast training speed.


Introduction
During the past several decades, face recognition (FR) has gained a widespread attention due to its potential application values as well as theoretical challenges compared with other biometrics [1].2D face recognition has achieved many good results under certain conditions.However, 2D FR is usually affected by different poses, expressions, illuminations, and occlusions and still needs to further improve the recognition accuracy in many real world applications.With the development of 3D scans and 3D information acquisition technologies, 3D face recognition (3D FR) has been proven to be quite robust to face variation with different illuminations and pose changes in achieving high recognition accuracy [2].
Though current 3D face recognition algorithms make a good performance in illumination and pose varieties, their performance declines in many new applications.For instance, only visible 2D photographs or low resolution video images of subjects are available while the gallery database may only consist of 3D face models.Nowadays, most FR systems and algorithms are not specifically developed for cross-modality 2D-3D face matching.Little work has been presented on 2D-3D feature representation.Furthermore, 3D facial feature extraction methods are generally time-consuming.The high computational complex limits its applications on high dimensional face features and large face databases.
To address this problem, a novel approach named Multiview Smooth Discriminant Analysis based on the recent Extreme Learning Machines (ELM) (Figure 1) is presented for cross-modality 2D-3D FR in this paper.In this new approach, Multiview Smooth Discriminant Analysis is first formulated to solve the multiple view-specific linear projections by adding the Laplacian smoothing constraint and to learn the common feature space for the cross-modality 2D-3D face features.Then, ELM is utilized as feature classifier by mapping the extracted feature to a high dimensional vector and treating the classification task as a regression problem.
The rest of this paper is organized as follows.Section 2 is the related works and Section 3 describes the Multiview Smooth Discriminant Analysis based on ELM method.Section 4 presents experimental results and discussions.The conclusion of this paper is drawn in Section 5.

Related Work
In this section, we briefly review some resent work related to our approach, such as 3D face recognition and Extreme Learning Machines (ELM).
The holistic-feature based methods usually use the whole face region as the input to extract facial features.Typical statistical methods, such as PCA [3] and LDA [4], which are popular 2D FR techniques, are also extended to 3D FR.PCA is used to extract the intrinsic discriminant feature vectors on 2D intensity images and 3D depth images, respectively.Then, the fusion method is utilized to get the final results [5].ICP-based 3D face recognition methods [6,7], which utilize the entire facial surface directly as the holistic features, have been applied into 3D face registration [8], and the recognition result is achieved by using these separated rigid parts from the nonrigid parts.However, this kind of 3D FR methods somewhat fails to consider the local geometry which contains the intrinsic structure of the 3D data distribution.Thus, these methods are sensitive to expression, illumination, and pose variations.
The local-feature based methods utilize the 3D face appearance or regional geometric features to represent the 3D faces.Shapes, curvatures, and 3D facial landmarks, as well as other feature descriptors, are used as intrinsic 3D local structure to solve the FR problems.For instance, Queirolo et al. [9] used a simulated annealing-based approach (SA) and the surface interpenetration measure (SIM) to quantify the differences of two face images.Berretti et al. [10] explored the complete geometrical information of the 3D face model and use the iso-geodesic stripes to distinguish facial feature differences.Tang et al. [11,12] proposed an expression insensitive 3D FR algorithm based on local pattern binary (LBP).Wang and Chua [13] proposed to use invariant 3D spherical Gabor filter (3D SGF) and the least trimmed square Hausdorff distance (LTS-HD) to handle the occlusions problems in 3D FR.
The hybrid methods jointly utilize the holistic and local features for 3D FR [16].Compared with the other two categories, hybrid methods take advantage of both the 3D spatial information and the global statistic characteristics and thus are demonstrated to be more robust in 3D FR.Spreeuwers [14] proposed to tackle the face expression variations by dividing the face surface into partially overlapped small regions and then a decision-level fusion approach is applied on these regions.Ter Haar and Veltkamp [15] preformed 3D face matching and evaluation using profile and contour of facial surface.Ming [17] proposed a 3D FR framework which utilizes the curvature information and orthogonal spectral regression for efficiently 3D discriminant feature extraction.However, most of these hybrid methods utilize two or more feature descriptors and thus have high computational complexity and cost.
Though new schemes have been proposed and achieved remarkable recognition performance on 3D face recognition, there are still some remaining unsolved problems that would affect FR performance.The performance of conventional 3D FR algorithms declines largely when there are only 2D images available for input test data.Besides, due to the expensive equipment, computational complexity, and the time-consuming 3D face preprocessing, it is difficult to perform on-line 3D FR in some real applications, such as airport or another security access control.So far as we know, there is very minimal work that has been done on this crossmodality 2D-3D face recognition [18][19][20].Yang et al. [18] proposed a regularized kernel CCA method to learn the feature differences between 2D photos and 3D depth images.Jelsovka et al. [20] proposed a 3D range image and 2D-3D face images matching method by using the facial curves and CCA.Some similar approaches, such as deep CCA [21] and dictionary learning [22,23], are also proposed to solve the cross-modality matching problem.CCA and its kernelization method are the typical approaches to learn a common subspace for two modality matching problem.However, CCA only learns the linear mapping by maximizing the total correlations between two views; it ignores the intraview and inter-view correlations.In other words, it has not taken the discriminative information into account.Recently, Kan et al. [24] proposed the Multiview Discriminant Analysis (MvDA) for cross-modality matching.Motivated by their work, we propose a Multiview Smooth Discriminant Analysis based on ELM.This new approach first uses a Laplacian smoothing constraint to make the mapping data spatially smooth and then takes advantage of the ELM as a high effective and less time-consuming classifier.

Extreme Learning Machines (ELM).
In this subsection, we will briefly review the ELM and its applications on pattern classification [25,26].ELM is recently proposed for efficiently training single-hidden-layer feedforward neural networks (SLFNs).ELM performs classification by mapping data to a high dimensional vector and changes the classification task into a multioutput functional regression problem [27].ELM provides better classification performance with a much shorter training time and the least human interference [26].In [27], a voting based ELM has been developed to enhance the performance of multiclasses classification.Employing multiple independent ELMs and making the final prediction with a majority voting method, V-ELM performs better than ELM with a higher classification rate.Huang et al. [ extended ELM to least square SVM (LS-SVM) [29] and proximal SVM (PSVM) [30] and provided a unified solution for multiclass classification.Kasun et al. [31] proposed an ELMbased auto encoder (ELM-AE) for big data application.Simulations on real world classification databases demonstrate that V-ELM generally outperforms several recent comparable methods with a fast training speed.

The Proposed Cross-Modality 2D-3D FR
In this section, we first introduce the basic idea and formulation of our proposed approach.Illustration of the Multiview Smooth Discriminant Analysis is shown in Figure 2.Then, we explain how to solve the optimization problem.Finally, the proposed 2D-3D FR approach is presented with data processing and ELM classification. is the number of samples from the ]th view data, the projected data in the common space is denoted as |  = 1, 2, . . ., ;  = 1, 2, . . .,  (])  ; ] = 1, 2, . . ., }.The Multiview Smooth Discriminant Analysis (MSDA) aims at smoothing the basis vectors of the face data from different views by applying the Laplacian smoothing functional.The objective function of MSDA can be defined as the following: where S x B and S x W are the multiview between-class scatter matrix and within-class scatter matrix, respectively.() is the discretized Laplacian regularization functional. is the smoothness-controlling parameter and 0 ⩽  ⩽ 1.
(1) Laplacian Smoothing.The Laplacian operator is defined as the following [32]: where  is the function defined on the region-of-interest.
And the Laplacian penalty function , which measures the smoothness of the function , is defined as In our new proposed method, we mainly focus on face images, and thus we take discretized Laplacian smoothing method [33].Let the  1 ×  2 face images be represented as vectors in R  and   ∈ R  be the basis vectors to be smoothed.For an image, whose region-of-interest Ω is a twodimensional lattice, let  = [ 1 ,  2 ], where  1 = 1/ 1 ,  2 = 1/ 2 and the two-dimensional vectors   = ( 1 ,  2 ), where   = (  − 0.5) ⋅   , 1 ⩽   ⩽   , 1 ⩽  ⩽ 2. The total of grid points in the lattice is  =  1 ×  2 .Suppose  = (( 1 ), . . ., (   )) is a   -dimensional vector which is a discretized version of function (); then   , which is an   ×   matrix that yields a discrete approximation to (2), has the property as follows: where  = 1, . . .,   .In our case, we choose the   to be the modified Neumann discretization [33,34].
Given   , a discrete approximation for 2D Laplacian  is an  ×  matrix and defined as where   is   ×   identity matrix for  = 1, 2. It is not difficult to prove that ‖Λ ⋅ ‖ 2 is directly related to the sum of the squared differences between nearby data points of , which is an  1 ×  2 dimensional vector.And it also demonstrates that (5) measures the smoothness of  on the  1 ×  2 lattice.
(2) The Algorithm and Solution.In this part, we will illustrate the MSDA algorithm and its solution.According to the formulation of MSDA, the between-class and within-class scatter matrixes are calculated from the samples of all the  views.Hence, the between-class and within-class scatter matrixes in (1) are formulated as the following: (11)   ,  (12)   , . . .,  (1)   (21)   ,  (22)   , . . .,  ) ,
Finally, the objective function of (1) can be determined by solving the following generalized eigenvalue decomposition with its leading eigenvalues:

ELM-Based Classification.
In order to speed up the training phase of the classifier and to obtain a reasonable recognition performance, the recent popular extreme learning machine [25,28] is employed in our FR framework.Based on a SLFN, the ELM classifier utilized in the proposed FR recognition system can be described as follows.
Assuming that the available training feature dataset is A = {(x  ,   )} N =1 , where x  ,   , and N represent the feature vector of the th face image, its corresponding category index, and the number of images, respectively, the SLFN with  nodes in the hidden layer can be expressed as where o  is the output obtained by the SLFN associated with the th input protein sequence and a  ∈ R  and   ∈ R ( = 1, 2, . . ., ) are parameters of the th hidden node, respectively.The variable w  ∈ R  is the link connecting the th hidden node to the output layer and (⋅) is the hidden node activation function.With all training samples, (10) can be expressed in the compact form as where W = (w where T = (t 1 , t 2 , . . ., t N ) is the target output matrix.ELM theory claims that random hidden node parameters can be utilized for SLFNs and the hidden node parameters may not need to be tuned.In such case, the system (11) becomes a linear model and the network parameter matrix can be analytically solved by using the least-square method.That is, where H † is the Moore-Penrose generalized inverse of the hidden layer output matrix H given by [35].The universal approximation property of the ELM algorithm is also presented in [25].

Experiments
In this section, we investigate the performance of our approach for cross-modality 2D-3D face recognition on FRGC 2.0.The new approach is compared with some stateof-the-art cross-modality learning methods, such as PLS [36], CDFE [37], PCA + CCA [38], and MvDA [39] and some neural networks based methods.The description of the face database and the 3D face preprocessing are presented in the following subsection.Then, the results of the experiments are concluded, as well as the discussion and experimental analysis.

Database Description and Experimental
Setting.We evaluate our experiments on the FRGC v2.0 [40] 2D versus 3D face database.In this experiment, FRGC v2.0, which contains 4007 2D photos and 4007 3D faces of 466 persons respectively, is utilized to evaluate the performance of the new method.The images of FRGC are acquired with a Minolta Vivid 910 and Minolta 910 scanner utilizes triangulation with a laser stripe projector to build a 3D face model.FRGC face database consists of frontal views up above shoulder, facial expressions, male and female face modals.Some data has facial hair, but none of them is occupied by glasses.In FRGC v2.0, 57% are males and 43% are females.Our previous work [41] is utilized for 3D data preprocessing and the 2D photos are corresponding to their respective 3D faces.
(1) 3D Data Preprocessing [41].The 3D data preprocessing consists of four main steps, face region detection, nose detection, face smoothing, and the generation of 2D and 3D face images.Firstly, a 3 * 3 Gaussian filter is used to moving spikes and noise firstly, and then the range data are subsampled at a 1 : 4 ratio.Ada-boost face detecting method [42] is applied on 2D texture image to help 3D facial region extraction.Secondly, we calculate the central stripe to detect the nose region and the nose tip is supposed to be on the central stripe.ICP is utilized to align the stripe of Person A to the stripe of Person B. Thus, the nose tip lays on the highest point in a cropped sphere.Once the nose tip is confirmed, a region-of-interest, which is defined by a sphere radius of 90 mm centered at the nose tip, is cropped and used in the following experiments.Figure 3 shows how to find the nose tips from the central stripe.
(2) Experimental Setting.In order to evaluate the robustness of our method, we divide the database into two sets, the training set and the test set.We pick out 285 subjects with more than 6 samples and select 5 samples of each person for training and the rest for testing.All 2D photos and the 3D range images are scaled, transformed, and cropped in the same way to 100 × 100 size according to the eye position.The cropped examples of FRGC database are shown in Figure 4.

Experimental Results.
In the following experiments, we use the five front images per person as the training set, and the remaining images are utilized as the testing set.In the testing phase, the 2D photos are utilized as the gallery set and their corresponding 3D range images are used as the probe set.Firstly, we compare a set of experiments of our method with some cross-modality learning methods with hidden nodes of ELM chosen to be 1000.The rank-1 recognition performance with different dimensions is reported in Figures 5 and 6. Figure 5 shows the experimental results of the MSDA based on ELM method compared with some cross-modality learning methods, such as PLS [36], CDFE [37], PCA + CCA [38], and MvDA [39]. Figure 6 shows the comparison results of the proposed method with some feature extraction based methods.It is clear to see that our method achieves the highest recognition rate (96.8%) compared to the related subspace learning methods.Furthermore, we still choose 1000 hidden nodes and compare the new method under different ELM activation functions, such as sigmoid, sine, and hardlim.Figure 7 shows the compared results.From Figure 7, we can conclude that ELM with sigmoid activation function gets the highest results in the three and the one with sine active function gets the worst.Secondly, the recognition performance as well as the training time of the proposed approach is compared with BP neural networks [43].The performance of the proposed method is compared with BP based method with a fixed dimension of 40.And the experimental results are obtained under different hidden nodes  = 100, 200, 500, 1000.However, for huge computational cost of the BP neural networks, it is acceptable only when the choice of BP hidden nodes is less than 1000.The recognition results and training time are reported in Table    Overall, the rank-1 recognition rate obtained by the new method is 96.8%, which is higher than that obtained using the other compared algorithms.It can be clearly seen that ELM-based MSDA method achieves better performance, and meanwhile, it takes a much faster training speed to get the good recognition result.

Conclusion
In this paper, a Multiview Smooth Discriminant Analysis based on ELM method is proposed for cross-modality 2D-3D face recognition.2D-3D face recognition is an alternative and feasible approach to the traditional 3D FR systems, and it is much more convenient to acquire a 2D image than building a 3D face model.In this new approach, the Multiview Smooth Discriminant Analysis (MSDA) is firstly performed to get the cross-modality face features by using the Laplacian smoothing functional.The Laplacian penalized functional considers the image spatial relationship in the feature level and therefore obtains a much smoother linear projection subspace than those without smooth subspace learning.Furthermore, the ELM, which reduces the computational cost, is utilized for face feature classification.Experimental results show that the proposed method consistently outperforms the other cross-modality matching algorithms and achieves good recognition performance in both accuracy and speed.

Figure 1 :
Figure 1: Intuitive explanation of our Multiview Smooth Discriminant Analysis based Extreme Learning Machines (ELM) approach.Multiview Smooth Discriminant Analysis is firstly used to obtain a more discriminant face representation; then ELM is utilized for pattern classification.

Figure 3 :Figure 4 :
Figure 3: The illustration of ICP based nose-tip finding approach.(a) shows the two 3D face models, and (b) is the ICP based matching result.

Figure 5 :
Figure 5: Comparison of the recognition rates with PLS, CDFE, PCA + CCA, and MvDA under different dimensions on the FRGC database.

Figure 6 :
Figure 6: Comparison of the recognition rates with feature extraction methods under different dimensions on the FRGC database.

Figure 7 :
Figure 7: Comparison of the recognition rates with different ELM activation functions.
28]The principle of Multiview Smooth Discriminant Analysis (MSDA).The goal of MSDA is to build a smoothing common discriminative space into which the face samples from different views are projected.The different classes of samples are distributed in different views and the MSDA projects the samples into a common subspace to make the samples closer.Here, images with different labels are denoted in different shapes, such as triangle, circle, and square.
(10)w 2 , ..., w  ) and O are the output weight matrix and the network outputs, respectively.The variable H denotes the hidden layer output matrix with the entry H  = (a  ,   , x  ).To perform multiclasses classification, the ELM classifier generally utilizes the one-against-all (OAA) method to transform the classification application to a multioutput model regression problem.That is, for a -categories classification application, the output label   of the face image feature x  is encoded to a C-dimensional vector t  = ( 1 ,  2 , ...,  C )  with  c ∈ {1, −1} (c = 1, 2, ..., C).If the category index of the face image x  is c, then  c is set to be 1 while the rest of entries in t  are set to be −1.Hence, the objective of training phase for the SLFN in(10)becomes finding the best network parameters set Δ = {(a  ,   , w  )} =1,..., such that the following error cost function is minimized min

Table 1 :
Performance comparison of ELM and BP in terms of rank-1 Rec.Rate (in (%)) and training time (in (s)).