Block-Wise Two-Dimensional Maximum Margin Criterion for Face Recognition

Maximum margin criterion (MMC) is a well-known method for feature extraction and dimensionality reduction. However, MMC is based on vector data and fails to exploit local characteristics of image data. In this paper, we propose a two-dimensional generalized framework based on a block-wise approach for MMC, to deal with matrix representation data, that is, images. The proposed method, namely, block-wise two-dimensional maximum margin criterion (B2D-MMC), aims to find local subspace projections using unilateral matrix multiplication in each block set, such that in the subspace a block is close to those belonging to the same class but far from those belonging to different classes. B2D-MMC avoids iterations and alternations as in current bilateral projection based two-dimensional feature extraction techniques by seeking a closed form solution of one-side projection matrix for each block set. Theoretical analysis and experiments on benchmark face databases illustrate that the proposed method is effective and efficient.


Introduction
Most well-known appearance-based face recognition methods are based on subspace techniques for feature extraction, such as principal component analysis (PCA) [1], linear discriminant analysis (LDA) [2], and maximum margin criterion (MMC) [3]. These conventional appearance-based techniques are based on the so-called vector-space model. Under this model, the original two-dimensional (2D in short) image data are reshaped into a one-dimensional (1D in short) long vector by stacking either rows or columns of the image. This vector-space model makes pattern recognition and analysis techniques be conveniently applied to image domain, and numerous successes have been achieved. However, it also introduces the following problems in practical applications. First, the intrinsic 2D structure of image matrix is removed. As a result, the spatial information stored in the 2D image is discarded and not effectively utilized for representation and recognition. Second, each image sample is modeled as a point in a high-dimensional space; for example, for an image of size 112 × 92, the commonly used image size in face recognition, the dimension of the vector space is 10304, and the size of the scatter matrices is 10304 × 10304. Obviously, a large number of training samples are needed to get a reliable and robust estimation of data statistics. This problem, known as curse of dimensionality, is often confronted in real applications. Third, a very limited number of data are usually available in real applications such that the small sample size (SSS) problem [4] comes forth frequently in practice.
To overcome the above drawbacks, efforts have been made to seek to extract the features directly without vectorization of image samples; that is, the representation of an image sample is retained in matrix form [5]. With this consideration, some bilateral projection based 2D feature extraction techniques have been proposed for seeking transforms on both sides of the image matrix, such as GLRAM (generalized low-rank approximation of matrices) [6], which can be seen as a kind of two-dimensional PCA, and 2DLDA (two dimensional LDA) [7], which can implicitly resolve the SSS problem suffered by LDA. These 2D methods are more 2 The Scientific World Journal computationally efficient than their 1D counterparts, respectively. And, GLRAM and 2DLDA are evaluated empirically to be more effective than PCA and LDA, respectively [6,7], due to preserving the intrinsic spatial information of data matrix.
Furthermore, two dimensional MMC (2DMMC) has been proposed [8], which aims to find two orthogonal projection matrices to project the original image matrices to a low-dimensional matrix subspace. In the projected subspace, a sample is close to those in the same class but far from those in different classes. Both theoretical analysis and experiments on benchmark face recognition datasets illustrate that 2DMMC is more effective and more efficient than GLRAM and 2DLDA. However, like GLRAM and 2DLDA, the algorithm of 2DMMC involves iterations and alternations of computing two-side projection matrices, which are timeconsuming, and an arbitrary initial value before iterations cannot guarantee the global optimum.
In this paper, we propose a novel framework for 2D generalization of conventional MMC to extract discriminating features directly from 2D face images. The proposed algorithm, namely block-wise two-dimensional maximum margin criterion (B2D-MMC), aims to find local subspace projections by obtaining one-side projection matrix in each block set, such that in the subspace a block is close to those belonging to the same class but far from those belonging to different classes. B2D-MMC introduces a block-wise dividing method for face images as in [9], and the dividing method has been proven to be reliable. Based on one-side projection and block-wise learning, B2D-MMC eludes seeking iterative and alternating two projection matrices, as in GLRAM, 2DLDA and 2DMMC, and has more power of learning local characteristics of images.
The rest of this paper is organized as follows. Section 2 provides background information on 2DMMC. In Section 3, our Block-wise Two Dimensional Maximum Margin Criterion is proposed. The experiments on standard face recognition datasets are demonstrated in Section 4. Finally, we draw our conclusions in Section 5.

Review on MMC and 2DMMC
2.1. LDA and MMC. The most popular unsupervised feature extraction method is principal component analysis (PCA). It aims to find a subspace in which the variance of the projected data is a maximum. But PCA does not take into account the class information, so the features extracted are not very suitable for classification [2]. Linear discriminant analysis (LDA) is a well-known supervised method which has been shown to be more effective than PCA in face recognition tasks [2].
As supervised feature extraction methods, MMC and LDA share the notations of between-class scatter matrix and within-class scatter matrix as follows.
Given a set of sample images {x 1 , x 2 , . . . , x } taking values in the -dimensional vector form, each belonging to one of classes. Assume the th class contains sample vectors The mean vector of the th class and that of the sample set are, respectively, given by The between-class scatter matrix S and within-class scatter matrix S are, respectively, defined as LDA is based on Fisher criterion, which aims to maximize the between-class distance and minimize the within-class distance as follows: where | ⋅ | denotes the determinant of matrix and w is the generalized eigenvector of S and S corresponding to the th largest generalized eigenvalue , that is, S w = S w , = 1, 2, . . . , .
If S is nonsingular, the solution can be obtained by applying an eigendecomposition to matrix S −1 S b . However, in face recognition applications, where generally the number of training images is much smaller than that of pixels in each image , one is confronted with the difficulty that the within-class scatter matrix S is always singular [2], since the rank of S is at most − . This is so-called the Small Sample Size (SSS) problem which the LDA method suffers from.
As an efficient and robust alternative to LDA, Maximum Margin Criterion (MMC) [3] is defined as where tr(⋅) denotes the matrix trace and is a weighted parameter which is set to 1 in [3]. MMC is to find the optimal projection matrix W = [w 1 , w 2 , . . . , w ], which is composed of the eigenvectors corresponding to the largest eigenvalues of S − S . The constraint W W = I allows MMC to avoid calculating the inverse of S and thus to elude the potential SSS problem.

2DLDA and 2DMMC
. 2DLDA [7] and 2DMMC [8] consider data with matrix representation and share the notations of between-class scatter and within-class scatter as follows.
The Scientific World Journal 3 Let X ( ) ∈ R × , = 1, 2, . . . , , be the images in the sample set belonging to the th class, = 1, 2, . . . , ( = ∑ =1 ). Both 2DLDA and 2DMMC aim to find two orthogonal projection matrices, U ∈ R × 1 and V ∈ R × 2 , that map each image matrix X ∈ R × to Y ∈ R 1 × 2 , such that Y = U XV. The mean matrix of the th class and that of the sample set are respectively given by In the low dimensional matrix space resulting from the linear transformation U and V, the between-class scatterS and within-class scatterS are, respectively, defined as For both 2DLDA and 2DMMC, the optimal transformations U and V would maximizeS and minimizeS .
2DLDA proposed in [7] can be formulated as The optimization (8) is with respect to U and V, and a closed form solution cannot be obtained. 2DMMC is defined in [8] as where is a weighted parameter. Also, a closed form solution can not be obtained due to bilateral unknown projections. Due to the difficulty of computing the optimal U and V simultaneously, 2DLDA and 2DMMC both utilize iterative alternating schemes; in each iteration, first they optimize the objective with respect to U when fixing V (V is initialized as any orthogonal matrix before iterations) and then optimize the objective with respect to V when fixing U. The alternating computation framework in each iteration is reviewed below.
Computation of U. For a fixed V,S andS can be rewritten asS where For 2DLDA, similar to the optimization problem in (3), the optimal U can be obtained by computing an eigendecomposition on (S V ) −1 S V that is composed of the 1 eigenvectors corresponding to the largest 1 eigenvalues of (S V ) −1 S V .
For 2DMMC, similar to the optimization problem in (5), the optimal U can be obtained by computing an eigendecomposition on S V − S V , that is composed of the 1 eigenvectors corresponding to the largest 1 eigenvalues of S V − S V .
Computation of V. From the property tr(AA ) = tr(A A) for any matrix A, when U is fixed, a key observation is thatS and S can be rewritten as where For 2DLDA, similar to the optimization problem in (3), the optimal V can be obtained by computing an eigendecomposition on (S U ) −1 S U that is composed of the 2 eigenvectors corresponding to the largest 2 eigenvalues of (S U ) −1 S U .
For 2DMMC, similar to the optimization problem in (5), the optimal V can be obtained by computing an eigendecomposition on S U − S U , that is composed of the 2 eigenvectors corresponding to the largest 2 eigenvalues of S U − S U .
In contrast to 2DLDA, 2DMMC has the following advantages which makes it stable and efficient: (1) the objective of (9) increases monotonically through iterations; hence the convergence of 2DMMC is rigorously guaranteed [8]; (2) 2DMMC avoids computing inverse matrices in each iteration.
However, as bilateral projection based 2D feature extraction techniques, 2DMMC, 2DLDA, and GLRAM share such shortcomings: The iterations and alternations are timeconsuming, and an arbitrary initial value of V cannot guarantee the global optimum.

Proposed Framework
Bilateral projection based 2D feature extraction techniques, such as 2DMMC, 2DLDA, and GLRAM, consider seeking transforms on both sides of image matrices, that is, both left and right projections are taken, but the computation of twoside projection matrices involves time-consuming iterations and alternations, and the initialization before iterations may lead to local optimum. In our study, in order to overcome forgoing shortcomings, we propose a framework that only takes right multiplication of each block to extract the interrow spatial information. Our block-wise approach to face recognition, namely, Block-wise Two Dimensional Maximum Margin Criterion (B2D-MMC), is described as follows.

Block-Wise Model for Face Recognition.
Since we deal with images cropped either manually or by a face detection procedure, our block-wise model divides the face image into nonoverlapping groups of rows, which are called image blocks. Let X ∈ R × denote a face image, where , are the numbers of rows and columns of X, respectively. X is divided into nonoverlapping image blocks X( ) ∈ R × , = 1, 2, . . . , , each including rows of image X. Figure 1 shows an example of image blocks. In the example, images of the first subject from the ORL database, which have the size of 112 × 92, are partitioned into four blocks of size 28 × 92, that is, = 112, = 92, = 4, and = 28.
For all sample images, the set of th image blocks is referred to as the th block set BS , which spans a subspace referred to as the th block manifold, = 1, 2, . . . , . The advocated B2D-MMC algorithm attempts to find a local subspace projection, that is, unilateral projection matrix, in each block set.

B2D-MMC.
Considering a -class problem, the th class contains training image matrices X ( ) ∈ R × , = 1, 2, . . . , , where X ( ) is the th training image in class , = 1, 2, . . . , , and , are the numbers of rows and columns of face images, respectively. Let be the total number of training images, that is, = ∑ =1 .
As determined in Section 3.1, each image X ( ) consists of blocks, each block including rows of the face image. Denoting the th image block of X ( ) as X ( ) ( ) ∈ R × , = 1, 2, . . . , , we have . . .
Thus the th block set BS can be formulated as Also let X ( ) ( , ) ∈ R 1× be the th row of X ( ) ( ), = 1, 2, . . . , . Then we can write . . . . ( For all training image matrices, the proposed B2D-MMC aims to find orthogonal right-side projection matrices, one for each image block set; that is, given a desired dimensionality , to find V( ) ∈ R × for the th block set BS , mapping the th image block And we use the following Y ( ) ∈ R × as the feature of image X ( ) for training: . . . The For classification, features of testing images are stacked by subfeatures in the same form as above.
The following shows how to find the projection matrices V( ), = 1, 2, . . . , . Let M ( ) ∈ R × and M( ) ∈ R × denote the mean of the th image blocks in the th class and the mean of the th block set BS , respectively, as follows Also let m ( , ) ∈ R 1× and m( , ) ∈ R 1× be the th row of M ( ) and M( ), respectively, = 1, 2, . . . , . Then we have Let us define the between-class block scatter matrix S ( ) and within-class block scatter matrix S ( ) of the th block set BS respectively as follows = 1, 2, . . . , . It is easy to verify that S ( ) and S ( ) are two × nonnegative definite matrices from their definitions.
In the low dimensional space resulting from the th linear transformation V( ), as in 2DMMC [8], we adopt the Frobenius norm ‖ ⋅ ‖ [10] as the metric of matrices, that is, ‖A‖ 2 = tr(AA ) = tr(A A) for any matrix A. Under this metric, the projected between-class block scatterS ( ) and projected within-class block scatterS ( ) can be respectively defined as follows The proposed B2D-MMC finds the orthogonal projection matrix V( ) for the th block set BS by the following optimization: where is a weighted parameter, = 1, 2, . . . , .
(a) Computational Complexity. B2D-MMC seeks a closed form solution of unilateral projection matrix for each block set instead of finding iterative solutions of two projection matrices for the entire image matrix, avoiding iterations and alternations as in 2DMMC, which saves the computational effort.
(b) Locality. Based on the block-wise model, B2D-MMC learns local characteristics of input image by dividing the face image into non-overlapping image blocks. Expectedly, distribution of data is much less complex inside these block manifolds.

Algorithm Design.
Based on the analysis above, our B2D-MMC algorithm is designed as in Algorithm 1.
Step 4. Compute the eigenvectors corresponding to the largest eigenvalues of S ( ) − S ( ) to form V( ) as in (25).
Step 5. If < , then = + 1, go to Step 2; else stop. Algorithm 1 Figure 2: Images of one person from the ORL face database.

Computational Complexity Analysis.
Most of the algorithms involve computations scale to (ℎ 3 ) for eigen-decomposition of an ℎ × ℎ matrix [10]. The eigen-decomposition of the scatter matrices in B2D-MMC amounts to a complexity of ( 3 ). However, as reviewed in Section 2, in 2DLDA and 2DMMC, the scatter matrices in each iteration are of size × , so the overall computation complexity of 2DMMC is ( 3 ), where is the number of iterations. Obviously we can expect that ( 3 ) is smaller than ( 3 ) when is considerable.

Experiments
In this section, to investigate the performance of the proposed B2D-MMC for face recognition, we compare our method with PCA [1], LDA [2], MMC [3], GLRAM [6], 2DLDA [7], and 2DMMC [8], in both accuracy and efficiency. Furthermore, the effect of image block size on recognition results is investigated.

Face Datasets.
In our experiment, we use two standard face recognition databases which are widely used as bench mark datasets in feature extraction literature.
The ORL Face Database. There are ten images for each of the 40 human subjects, which were taken at different times, varying the lighting, facial expressions and facial details. Images from one subject are shown in Figure 2. The original images (with 256 gray levels) have size 92 × 112, which are resized to 32 × 32 for efficiency.
The Yale Face Database. It contains 11 gray scale images for each of the 15 individuals. The images demonstrate variations in lighting condition, facial expression, and with/without glasses. Images from one subject are shown in Figure 3. In our experiment, the images were also resized to 32 × 32. samples, and the rest were used for testing. The training set was used to learn = 4 subspaces, each for one block set. Thus the size of the block set is 8 × 32. Features of images for classification were stacked by sub-features in the form of (18), and the recognition was performed by Nearest Neighbor Classifier, with the Frobenius norm as the similarity metric. Since the training set was randomly chosen, we repeated each experiment 20 times and calculated the average recognition accuracy. In general, the recognition rate varies with , that is, the number of columns of the feature (projected image). We set to the corresponding dimensionality when the best performance was obtained by 2DMMC [8]. Tables 1 and 2 show the experimental results of the proposed B2D-MMC on the two databases, respectively, with the best results of PCA, LDA, MMC, GLRAM, 2DLDA, and 2DMMC referred from [8] for comparison. For all the methods, the value in each entry represents the average recognition accuracy of 20 independent trials, and the number in brackets is the corresponding projection dimensionality.

Comparison on Classification Accuracy.
Since the value of dimensionality , which corresponds to the best performance obtained by 2DMMC, is not necessarily the best choice for our B2D-MMC, it is clear that B2D-MMC outperforms 2DMMC and the other feature extraction methods on both of the two data sets.

Comparison on Efficiency.
In this subsection, B2D-MMC is compared with 2DMMC in computational efficiency. We take the ORL and the Yale datasets where TN = 2 for example; that is, two training samples are randomly selected for each subject.
For 2DMMC, we record the training time in the following way: taking the entries in Tables 1 and 2 as the best classification accuracies, that is, 78.75% as the best on the ORL and 54.37% as the best on the Yale dataset, the iteration of training process stops if the difference between the obtained classification accuracy and the best classification accuracy is smaller than 0.1%. And the projection dimensionality of the training process is set to the corresponding value of the best classification, that is, 12 × 12 for the ORL and 6 × 6 for the Yale dataset. The average training time of B2D-MMC and 2DMMC, over 20 independent runs on a typical laptop using MATLAB, is shown in Figure 4. It can be seen that B2D-MMC is more efficient than 2DMMC, both on the ORL and on the Yale dataset. This conforms to the complexity analysis in Section 3.4. This must be because that, unlike 2DMMC which is to find iterative solutions of two projection matrices for the entire image matrix, B2D-MMC seeks a closed form solution of one-side projection matrix for each block set, avoiding iterations and alternations as in 2DMMC, which decreases the computational load.

Effect of Number of Blocks on Recognition
Results. The proposed B2D-MMC has been applied on the ORL and the Yale datasets with the same settings as in Section 4.1 but for three different values of , namely 3, 4, and 5. Results, shown in Figures 5 and 6, reveal that the performance of B2D-MMC achieves optimum when takes an appropriate value, for example, * = 4, and neither raising nor reducing the value of * degrades the performance of B2D-MMC. It can be interpreted as follows. An increase in the number of blocks per image helps to learn more local characteristics; however, a decrease in the number helps to utilize more global 8 The Scientific World Journal    characteristics. The optimal recognition performance results from the tradeoff between local and global information.

Conclusions
This paper proposed a novel framework to extract discriminating features directly from 2D face images. The proposed B2D-MMC introduces a block-wise model for face  recognition, performing one-side subspace projection inside each block manifold, in which a block is close to those belonging to the same class but far from those belonging to different classes. The unilateral projection and the blockwise learning avoid iterations and alternations as in current bilateral projection based two-dimensional feature extraction approaches, and have advantages in complexity and locality.
Computational complexity analysis shows that B2D-MMC consumes less time than 2DLDA and 2DMMC when