Extrinsic Least Squares Regression with Closed-Form Solution on Product Grassmann Manifold for Video-Based Recognition

Least squares regression is a fundamental tool in statistical analysis and is more effective than some complicated models with small number of training samples. Representing multidimensional data with product Grassmann manifold has recently led to notable results in various visual recognition tasks.This paper proposes extrinsic least squares regression with ProjectionMetric on product Grassmann manifold by embedding Grassmann manifold into the space of symmetric matrices via an isometric mapping. The proposed regression has closed-form solution which is more accurate compared with numerical solution of previous least squares regression using geodesic distance. Experiments on several recognition tasks show that the proposedmethod achieves considerable accuracy in comparison with some state-of-the-art methods.


Introduction
As an important application of computer vision, video-based recognition such as action recognition [1] attracts more and more attention. For inferring the correct label of a query in a given database of examples, there are mainly two kinds of methods. One kind approach is based on representations with the handcrafted features and the other kind is based on deep learning architectures such as Convolutional Neural Networks (CNN) [2]. Generally speaking, deep learning algorithms have been shown to be successful when large amount of data is available [3,4]. However, the size of database for many recognition tasks in daily life is small. In this case, deep learning algorithms lose efficacy and it becomes important to analyze the structure of data and represent it with discriminant features.
Nowadays, Grassmann manifold has proven a powerful representation for video-based applications like activity classification [5], action recognition [6], age estimation [7], face recognition [8,9], and so on. In the above applications, Grassmann manifold is used to characterize the intrinsic geometry of data. Taking one representative work as an example, Lui [10] factorized a data tensor using Higher Order Singular Value Decomposition (HOSVD) and imposed each factorized element on a Grassmann manifold. This representation yields a very discriminating structure for action recognition.
Inference on manifold spaces can be achieved extrinsically by embedding manifold into Euclidean space, which can be considered as flattening the manifold. In the literature, the most popular choice for embedding manifold is through considering tangent spaces [11,12]. For example, Lui [10] presented a least squares regression on product Grassmann manifold, in which the weighted average from the training samples was computed in tangent space and was projected back to Grassmann manifold by standard logarithmic and exponential map. The distance between points to the tangent pole is equal to geodesic distance, which is restrictive and may lead to inaccurate modeling. An alternate method considers embedding Grassmann manifold into space of symmetric matrices by a diffeomorphism [13] and uses Projection Metric [14] which is equal to the true Grassmann geodesic distance up to a scale of √ 2.

Mathematical Problems in Engineering
In this paper, by representing multidimensional data on product Grassmann manifold with same form as Lui [10], we propose an extrinsic least squares regression on product Grassmann manifold using Projection Metric and give a closed-form solution which is more accurate. Least squares regression as a simple statistical model has many advantages such as simple calculation and being more effective than some complicated models with small number of training samples [15]. We experiment with the proposed method on three kinds of small-scale datasets including hand gesture, Ballet, and traffic; the higher recognition rates reveal that our method is competitive to some state-of-the-art methods.
The rest of this paper is organized as follows: Section 2 introduces mathematical background; Section 3 gives product Grassmann manifold representation for video; Section 4 presents distance on product Grassmann manifold; Section 5 proposes extrinsic least squares regression on product Grassmann manifold; Section 6 gives classification based on extrinsic least squares regression; Section 7 shows experiments on different datasets, and experiment results show that the proposed method achieves considerable accuracy; Section 8 analyzes the time complexity of proposed method and Section 9 gives a conclusion.

Mathematical Background
In this section, we introduce the mathematical background used in this paper.

Grassmann Manifold.
Stiefel manifold S( , ) is the set of all × matrices with orthonormal columns; that is, where I p is the × identity matrix. Grassmann manifold G( , ) can be defined as a quotient manifold of S( , ) with an equivalence relation ∼. In fact, for any X, Y ∈ S( , ), where span(X) is the subspace spanned by columns of X ∈ S( , ). In other words, Grassmann manifold G( , ) is the space of -dimensional linear subspaces of R for 0 < < [16], which may be specified by arbitrary orthogonal matrix with dimension × . Notice it is not unique for the choice of matrix X for a point span(X) on Grassmann manifold; that is, the same point on Grassmann manifold can be spanned by different matrix X and Y.

Higher Order Singular Value Decomposition (HOSVD).
HOSVD is a multilinear SVD operating on tensor. Let A ∈ R 1 × 2 ×⋅⋅⋅× be a tensor with order . The process of reordering the elements of an -mode tensor into a matrix is called matricization. The mode-( = 1, . . . , ) matricization of a tensor A is denoted by A ( ) (see details in [17]). Then each A ( ) is factored using SVD as follows: where Σ ( ) is a diagonal matrix, U ( ) is an orthogonal matrix which spanned the column space of A ( ) , and V ( ) is an orthogonal matrix which spanned the row space of A ( ) . By using HOSVD method, an order tensor A can be decomposed as follows: where S ∈ R 1 × 2 ×⋅⋅⋅× is core tensor, U (1) , U (2) , . . . , U ( ) are orthogonal matrices given in (3), and × denotes modemultiplication.

Product Grassmann Manifold Representation for Video
Video is a kind of multidimensional data and can be represented as tensor A ∈ R 1 × 2 × 3 , where 1 , 2 , and 3 represent height, width, and length of video, respectively. The variation of each mode can be captured by HOSVD. Lui et al. [18] found that traditional HOSVD is not appropriate for forming product manifold, so they redefined the traditional definition of HOSVD to factorize tensor using the orthogonal matrices V (1) , V (2) , and V (3) described in (3). That is, ) is a representation for videos on product Grassmann manifold.

Distance on Product Grassmann Manifold
The metric on Grassmann manifold is geodesic distance which is the shortest curve between two -dimensional subspaces X and Y, that is, (X, Y) = √∑ =1 sin 2 with representing the principal angles [16]. Recently, Chikuse [13] introduced a projection embedding Π : G( , ) → sym( ), Π(X) = XX , where sym( ) denotes space of symmetric matrices. And Hamm and Lee [19] defined a distance called Projection Metric on Grassmann manifold as follows.
Mathematical Problems in Engineering 3 Definition 1. Given two points X and Y on Grassmann manifold G( , ), the distance between X and Y is defined as Remark 2. In fact, for any matrix X ̸ = Y, there exists a × orthogonal matrix Q such that X = YQ , then element span(X) ∈ G( , ) is equal to element span(Y) ∈ G( , ).
Hence it is feasible to use the matrix X representing span(X). And (X, Y) is equal to geodesic distance of two points on Grassmann manifold [14].
Based on Definition 1, we give a kind of definition of distance on product Grassmann manifold which sums distance of each factor Grassmann manifold.

Extrinsic Least Squares Regression on Product Grassmann Manifold
Least squares regression is a simple and efficient technique in statistical analysis. In Euclidean space, parameter ∈ R ×1 is estimated by minimizing the residual sum-of-square error where A ∈ R × is training set and y ∈ R ×1 is regression value. The estimated parameter has closed solution aŝ Hence the corresponding error is In Grassmann manifold space, Lui [10] extended the linear least squares regression to a nonlinear form. In detail, the estimated parameter is equal to where ⋆ is a nonlinear similarity operator, A is a set of training samples on manifold, and y is an element on manifold. So the corresponding error is where ∘ is an operator mapping points from vector space back to manifold. While Grassmann manifold is not closed under normal matrix subtraction and addition, the mapping ∘ is realized by employing exponential mapping and its inverse without closed-form solution. To realize the composition map ∘, an improved Karcher Mean Computation algorithm is employed. To avoid loss of the above iterative algorithm, we introduce an extrinsic least squares regression on Grassmann manifold by embedding its elements to space of symmetric matrices. Due to the distance on product Grassmann manifold in (8) being additive for each factor, the extrinsic least squares regression on product Grassmann manifold equals three independent subregression problems on each factor. Taking one factor as example, we show the details in the following.
Let {D ∈ G( , )} =1,2,..., be training set where is number of samples, and y = (y ) ∈ R ×1 is fitting parameter. X ∈ G( , ) is regression value. Similar to the idea of least squares regression in Euclidean space, we give a regression on Grassmann manifold, which is defined in the embedded space of symmetric matrices. The residual is measured as follows: where y is the th element in vector y. Next we show how to solve the optimization. We have Hence model (14) becomes Let derivation of (17) with respect to y equal to 0; we have ( (D) + (D) ) y − 2 (X, D) = 0.
So the solution of optimization (14) is Hence the corresponding error becomes

Recognition Based on Extrinsic Least Squares Regression
In this subsection, we consider 3-order product Grassmann manifold for videos, while the situation for higher order is similar. Suppose classes are defined for the data. We denote training set corresponding with the th class as {(U , V , W )} =1,2,..., , where is number of samples. Our objective is inferring to which class the test sample (X, Y, Z) ∈ G( 1 , 1 )×G( 2 , 2 )× G( 3 , 3 ) belongs.
The residual error of query sample (X, Y, Z) for class is defined as where * , * , * ( = 1, . . . , ) are solutions of subregression on each factor Grassmann manifold, respectively. The category of the query sample is determined by * = arg min .

Experiments on Different Datasets
In this section, we show performance of the proposed method against some state-of-the-art methods on two kinds of datasets.  [20] contains 900 video sequences with nine kinds of hand gestures, which is divided into 5 sets according to different illuminations. Figure 1 shows some hand gesture samples. Set 5 (normal illumination) is considered for training while the remaining sequences (with different illumination characteristics) are used for testing. The original sequences are converted to grayscale and resized to 24 × 32 × 23. We denote our method as ELSR and report the correct recognition rate (CRR) for the four illumination sets in Table 1. Compared with product manifold (PM) [10], Grassmann Sparse Coding (gSC) [14], Grassmann Locality-Constrained Coding (gLC) [14], kernel Grassmann Sparse Coding (kgSC) [14], and kernel Grassmann Locality-Constrained Coding (kgLC) [14], we find that our method is competitive to these state-of-the-art methods.

Ballet
Dataset. 44 videos are collected from a Ballet instruction DVD as the Ballet dataset [21]. In fact, 8 complex motion patterns from 3 persons are included in the dataset. In detail, the actions are "right-to-left hand opening," "left-toright hand opening," "standing hand opening," "jumping," "leg swinging," "hopping," "turning," and "standing still". The main challenge of this dataset is large variations among classes such as speed, clothing, and motion paths. Figure 2 shows some examples of the dataset. Table 2 shows ELSR has superior performance compared with gSC-dic, gLC-dic, kgSC-dic, and kgLC-dic [14].

Scene Analysis.
For scene analysis, we use the UCSD traffic dataset [22] which contains 254 videos of highway traffic under different weather conditions. Resolution is 320× 240 and number of frames ranges from 42 to 52. The dataset is divided into three classes ("heavy," "medium," and "light") according to traffic congestion level. In total, there are 44 sequences defined as heavy traffic, 45 sequences labeled as medium traffic, and 165 sequences are light traffic. Figure 3 Mathematical Problems in Engineering 5 Figure 2: Examples from the Ballet dataset.  Method CRR gSC-dic [14] 79.64 ± 1.1% gLC-dic [14] 81.42 ± 0.8% kgSC-dic [14] 83.53 ± 0.8% kgLC-dic [14] 86.94 ± 1.1% ELSR 96.31 ± 1.7% shows some typical examples. In experiment, we use the first 40 frames of each video and they are normalized as grayscale with resolution 48 × 48. We adopt the four pairs of training and testing sets provided in paper [23]. The classification results are shown in Table 3; the average correct recognition rate of ELSR is higher than that of gSC and gLC but lower than kgSC and kgLC.

Discussion.
Through above experiments, we conclude that the proposed method is more effective for action recognition than scene analysis. In fact, the product Grassmann manifold could capture the appearance, horizontal motion, and vertical motion through three factor manifolds. To visualize the product manifold representation, the overlay appearance, horizontal motion, and vertical motion of examples from three dataset are given in Figure 4. Note that there are obvious variation features along horizontal motion for hand gesture examples, both horizontal and vertical motion for Ballet examples. These curves in last two columns characterize the motion and are the key factors for recognizing. This can be seen as an explanation of the higher CRR result of ELSR on Ballet dataset. Meanwhile, for samples from UCSD, horizontal and vertical motion features are not clear because of all cars running along the same path, and the critical factor is appearance, characterizing the number of cars. Hence for UCSD dataset, the CRR of ELSR is just little higher than gSC and gLC, but lower than kgSC and kgLC which maps to higher-dimensional manifolds using kernel function to diminish nonlinearity.

Conclusion
In this paper, we propose extrinsic least squares regression on product Grassmann manifold. Video can be viewed as third order tensor and then transformed to point on product Grassmann manifold factorized through HOSVD. One advantage of this method is the regression has closedform solution which guides to a more accurate ratio of correct recognition. And when number of training samples is small, the proposed method is efficient. Several experiments on different recognition tasks (hand gesture recognition, action recognition, and scene analysis) show that our method performs very well on three small-scale public datasets.
In future work, we would like to devise kernel version of extrinsic least squares regression on product manifold.

Conflicts of Interest
The authors declare that they have no conflicts of interest.