Similarity Measure Learning in Closed-Form Solution for Image Classification

Adopting a measure is essential in many multimedia applications. Recently, distance learning is becoming an active research problem. In fact, the distance is the natural measure for dissimilarity. Generally, a pairwise relationship between two objects in learning tasks includes two aspects: similarity and dissimilarity. The similarity measure provides different information for pairwise relationships. However, similarity learning has been paid less attention in learning problems. In this work, firstly, we propose a general framework for similarity measure learning (SML). Additionally, we define a generalized type of correlation as a similarity measure. By a set of parameters, generalized correlation provides flexibility for learning tasks. Based on this similarity measure, we present a specific algorithm under the SML framework, called correlation similarity measure learning (CSML), to learn a parameterized similarity measure over input space. A nonlinear extension version of CSML, kernel CSML, is also proposed. Particularly, we give a closed-form solution avoiding iterative search for a local optimal solution in the high-dimensional space as the previous work did. Finally, classification experiments have been performed on face databases and a handwritten digits database to demonstrate the efficiency and reliability of CSML and KCSML.


Introduction
Pairwise matching, which is based on a measure (similarity or dissimilarity), is ubiquitous in multimedia applications. The performances of multimedia learning techniques depend sensitively on the selected measure [1][2][3]. Recently, measure learning has become an active research problem for multimedia learning tasks, for example, image classification [4,5]. The previous measure learning studies mainly focused on distance (dissimilarity) learning. One of the earliest distance learning algorithms was presented by Xing et al. [6], where a parameterized Mahalanobis distance was learned. Many distance learning studies were followed [7][8][9][10], which would be overviewed later. There are two aspects of disadvantages for a distance metric. On the one hand, a distance learning task results in an optimization problem which is usually not easy to give a closed-form solution. Xing et al. [6], Lee et al. [11], Kumar and Kummamuru [12], Jin et al. [13], and Yin et al. [14] all described the distance metric learning through iterative process. The iterative methods are difficult to be extended to kernel versions. Moreover, the iterative procedure is inefficient and unstable. On the other hand, several recent studies suggest that the strict metric axioms (self-similarity, symmetry, and triangle inequality) are epistemologically invalid for perceptual distance of human beings [15,16] and not so suitable for robust pattern recognition [17].
The other aspect of the relationship between two objects in learning tasks is similarity. Since the measure models vary in engineering practice, dissimilarity and similarity are not simply complementary. The similarity cannot be simply viewed as the negative or reciprocal dissimilarity. It is necessary to distinguish these two notions. The similarity measures include two categories: inner product based and kernel function based, which were both considered in this work. 2 The Scientific World Journal Many publications support that the intrinsic structure of the feature space for image classification lies on lowdimensional manifolds [18][19][20]. Compared with Euclidean distance, correlation has some competitive abilities to capture the intrinsic structure embedded in the high-dimensional data. Correlation is a type of normalized inner product and a scale invariant index. It corresponds to the notion of "angle" in geometrical theory. In recent years, some studies have used correlation as a similarity measure for dimension reduction [21][22][23]. However, since correlation is in the fraction form, the existing correlation-based dimension reduction algorithms, such as correlation embedding analysis (CEA) [21], canonical correlation analysis (CCA) [22], and correlation discriminant analysis (CDA) [23], constructed low-dimensional embeddings through the iterative procedures.
In this work, we presented a similarity measure learning framework for supervised classification. Particularly, under this framework, a correlation similarity measure learning algorithm was constructed with a closed-form solution. It did not need iterative update process and is only involved in eigenvalue decomposition operations. Furthermore, it was extended into a kernelized version.
In order to learn an appropriate similarity measure, dissimilarity metric (distance) learning and dimensionality reduction can bring us much inspiration. Here, we provided a concise review on them.

Dissimilarity Metric Learning.
Many dissimilarity metric learning algorithms have been presented in a variety of application areas. From the diverse points of view, these methods can be divided into different ways. Generally, there are two ways to categorize them: (1) unsupervised learning and supervised learning and (2) global method and local method. In this work, the latter categorization scheme is adopted.
For global methods, the well-known one is the earlier distance metric learning algorithm Xing et al. presented [6], which will be shown in detail later. This algorithm is further extended to the nonlinear case in [24] by the introduction of kernels, where a given kernel is idealized such that it becomes more similar to the ideal kernel also leads to a quadratic programming problem. Relevant component analysis (RCA) [25] learns a global linear transform from the equivalent constraints. Instead of iterative solution in [6], it only uses closed-form expressions of data and is based on subsets of points so-called chunklets. However, RCA has two important disadvantages. One is the lack of exploiting negative constraints which can also be informative, and the other is its incapability of capturing complex nonlinear relationships between data instances with the contextual information [8]. Discriminative component analysis (DCA) and kernel DCA [8] improve RCA by exploring negative constraints with contextual information. Kernel RCA [26] and kernel DCA use kernel trick to discover the nonlinear structures of the given data. Recently, Wang [9] proposed a method to learn Mahalanobis distance metric in semisupervised mode by maximizing the so-called constraint-margin maximization (CMM) criterion. CMM is based on graph embedding framework [27] often used in dimensionality reduction problems.
All the global methods are based on global constraints or side information. However, the real-world data may not satisfy the global linear assumption. So more approaches fall into local based category which approximates global nonlinear data structures based on local linear alignment. Discriminant adaptive nearest neighbor [10] estimates a local distance metric using the local linear discriminant analysis. Neighbourhood components analysis [28] learns a Mahalanobis distance metric by directly maximizing a stochastic variant of the leave-one-out KNN score on the training set. The maximum-margin nearest neighbor (LMNN) classifier [29] extends NCA through a maximum-margin framework. It reformulated the optimization problem as an instance of semidefinite programming, which was also solved by iterative process. Many other recent studies [29][30][31] also focus on neighbor information.

Dimensionality Reduction.
Most algorithms above are based on so-called Mahalanobis distance function framework, which may be viewed as constructing a global linear transformation of the data and then applying the Euclidean distance over the transformed data. Mahalanobis distance is as follows: It requires ⪰ 0 to ensure that this can be used as a metric. So can be represented as = . Then to learn Mahalanobis distance is equivalent to finding a transform matrix ( = ). Learning the transformation matrix can yield the Mahalanobis metric = according to For those dimensionality reduction methods without explicit transformation, they may also be viewed as searching appropriate embedding in a lower-dimensional space. So distance metric learning has an affinity with dimensionality reduction.
Methods on dimensionality reduction can be divided into two categories: (1) with explicit transformation and (2) with implicit transformation. The former includes almost all subspace learning algorithms. PCA, LDA, NMF, LPP [32], Laplacian Eigenmap [33], and their extended visions [34][35][36], all result in a transform matrix with optimizing some objective criterions. Most of classical manifold learning algorithms, such as LLE, ISOMAP, and LTSA, belong to the latter category. Since being without explicit transformation, manifold-based methods are more suitable for data visualization than classification. Inspired by NMF and LPP, graph embedding framework [27] becomes popular [37,38]. It provides more flexibility through designing diverse graphs and weight matrices.
This paper aims at solving the following problem: given a set of sample data with class labels or pairwise constraints, The Scientific World Journal 3 the task is to learn an appropriate similarity measure for classification. In the beginning of this paper, several related distance learning and dimensionality reduction algorithms will be introduced. The previous work of correlation usage in classification will be also discussed in Section 2. The methods of Xing et al. and Xiang et al. are both used to learn the distance metric. However, Xing et al. mainly concentrated on clustering application. CEA [21], CCA [22], and CDA [23] all apply correlation for classification. Moreover, CEA [21] is also based on graph embedding framework [27] as our method does. These algorithms will be all described detailedly in Section 2.
Sections 3, 4, and 5 form the core of this paper. Section 3 gives the precise definition of similarity measure learning problem and introduces a general formulation for it. This formulation can be specified to diverse measure learning algorithms depending on the determination of neighbor graph, affinity weights, and similarity measure. In Section 4, a strategy is given to form specific similarity measure learning algorithm. Firstly, a generalized correlation is defined. After that, two kinds of constraints are introduced, which are based on two kinds of neighbor graphs and corresponding affinity weights. Most importantly, an approximate optimization and its closed-form solution are presented. Following that, it is extended to the nonlinear version in Section 5. Experiments have been conducted to prove the effectiveness of these new measure learning approaches for classification. They will be reported in Section 7. Additionally, discussions and conclusions will be given, respectively, in Sections 6 and 8.
The overall sequence of the core sections in this paper can be illustrated as follows.

SML-A framework
The definition of SML problem General framework for SML CSML-An algorithm Generalized correlation Optimization problem for CSML The closed-form solution of CSML KCSML-A nonlinear extension of CSML.

Related Work
This section provided a brief overview of closely related studies. From this analysis, our work would be placed in the context of other algorithms.

Xing and Xiang's Methods.
Consider the form of a distance metric as follows: where ⪰ 0. Xing et al. introduced one of the earliest distance metric learning methods using both positive and negative constraints [6]. They posed distance metric learning as the following convex optimization problem: where was the set of positive constraints and was the set of negative constraints. The optimal metric was found by minimizing the distances between data points in affinitylink constraints and simultaneously maximizing the distances between data points in apart-link constraints. Xing et al. [6] used the gradient descent and the idea of iterative projection to solve the problem (4). Although the presented optimization problem was convex, it was a hard problem to solve. And the introduced solution in [6] was slow and somewhat unstable [8].
Xiang et al. [39] introduced the trace-ratio objective function (with the constraint = ) as a more appropriate objective function: However, this problem cannot be directly solved by eigenvalue decomposition approaches. To solve the problem (5), Xiang et al. [39] had constructed an iterative framework, in which a lower bound and an upper bound including the optimum were estimated for initialization. Their proposed method provides a heuristic search to solve the problem (5).
In this work, we propose a generalized form of similarity measure learning rather than dissimilarity measure learning and provide a closed-form solution of objective function with correlation similarity.

CEA.
Fu et al. [21] introduced correlation embedding analysis (CEA) for dimensionality reduction. Firstly, two undirected weighted graphs, the intrinsic graph = ( , ) and the penalty graph = ( , ), were constructed. was a set of data vertexes and , ∈ × are weight matrices. The intrinsic graph characterizes data links that the algorithm favors and the penalty graph describes relationships that the algorithm tries to avoid. Then, a graphpreserving criterion is imposed for these two objectives as where and are the elements of weight matrices and , respectively. It can be viewed as finding transformation 4 The Scientific World Journal matrix in the linear transformation space of normalized samples. The formulation (6) can be rewritten as This objective function is nonlinear and not convex. Fu et al. [21] used the gradient descent rule for optimization by differentiating ( ) with respect to matrix W. As pointed in [21], the gradient descent may not be deep enough to converge to a good solution when the dimension of the data space is too large. So the iterative process is sensitive on the initial point although the method to find a good initialization was proposed. In this paper, we transform the problem (7) into another optimization problem which can be solved with closed-form solution.

Correlation in Classification.
Next, we will focus on the usage of correlation in classification. Hardoon et al. [22] introduced canonical correlation analysis (CCA). It can be viewed as the problem of finding basis vectors for two sets of variables such that the correlations between the projections of the variables are mutually maximized. If one set of variables is taken as class labels, CCA can be used to realize a supervised linear feature extraction and subsequent classification. It has been extended to a nonlinear version kernel CCA by kernel trick. However, there are some problems when it is used in classification application as pointed in [23], which limits its utilization in practice.
Ma et al. [23] introduced correlation discriminant analysis (CDA) which sought a global linear transformation to maximize the correlation of samples from different classes in the transformed space. Its optimization problem was where In [23], this problem was also solved by gradient-based optimization method. However, the extension of CDA to kernel CDA was not very easy to be implemented, as pointed in [23].

General Framework
Similarity measure learning (SML) is a general framework for similarity measure learning problem. In the context of general supervised classification, the SML problem may be formulated as follows: given a labeled sample set {( , )}, with instances, { } =1 ∈ , and is the feature dimension. The corresponding class label is { } =1 ∈ {1, . . . , }, where is the number of classes. Suppose that the similarity measure between arbitrary two objects and is ( , , ), where is a set of parameters to be learned. The goal of SML is to learn the parameter set from the sample set {( , )}.
We now introduce SML problem from the novel point of view of graph embedding. Let = {( , Δ)} be an undirected weighted graph with vertex set and relation matrix Δ ∈ × . We define an intrinsic graph = {( , Δ )}, where Δ = [ ] × , and a penalty graph Vertices of graph are the same as those of graph , but the matrix Δ corresponds to the relations that are to be strengthened and the matrix Δ corresponds to the relations that are to be suppressed in the learning process.
Based on the above evidences, we get the formal definition of the similarity measure learning. Definition 1. The similarity measure learning (SML) problem is to learn an optimal similarity measure [ ] × from a collection of data points on a vector space together with a set of intrinsic pairwise constraints and a set of penalty pairwise constraints , which can be formally formulated into the following optimization framework: where is a set of parameters to be learned and is some objective function defined over the given data.
Inspired by graph embedding learning in dimensionality reduction, SML can be formulated as the following two objectives based on graph-preserving criterion: To combine these two objectives into a unique optimization problem, there exist several different ways [27]. In this work, we consider the difference-form formulation; namely, It can be seen from Definition 1 that the method proposed in the next section will be also suitable for classification problem with pairwise constraints instead of labels.

Correlation Similarity Measure Learning
In this section, we introduce a generalized correlation measure . Based on the generalized correlation, an algorithm of SML, called correlation similarity measure learning (CSML), is proposed. It aims at learning a correlation similarity measure for classification. The details are summarized in Algorithm 1.

Objective Function.
"Correlation" is one of widely used measures to reflect the similarity between two random variables. Correlation is also termed as normalized correlation, correlation coefficient, Pearson's correlation, or cosine similarity, and hereafter correlation for simplicity. Two samples (e.g., images) are represented as two vectors and in a feature space, and then the standard form of correlation is In learning tasks, to make the similarity measure flexible to sample data, we define a generalized correlation.
Definition 2. The generalized correlation of random vectors and is defined as where ∈ × is a parameter matrix and symmetric positive semidefinite; for example, ⪰ 0.
So, in the paper, let = , where = [ 1 , 2 , . . . , ] ∈ × and is an alternative parameter. Generally, matrix parameterizes a family of the correlations on the vector space . Specifically, when is an identity matrix × , the generalized correlation in (14) becomes the standard correlation.
This type of correlation measure assigns different importance on series of features rather than equally processing as standard correlation coefficient does. It enhances the flexibility of the similarity measure. The parameter matrix could be adaptive for sample data.
Equation (14) can be modified to its equivalent form as where tr(⋅) denotes the trace of a matrix. Substitute (15) into the optimization problem (12) and then we obtain the objective function as follows:  The Scientific World Journal unambiguously labeled; there are no real-world objects in the application which belong to more than one class. Moreover, this structural representation can utilize the prior knowledge or supervised information in an alternative way, which will be discussed later.

Global Constraints.
For , the node and the node are connected by an edge if and belong to the same class, otherwise not connected. For , the edge between and is constructed if and belong to different classes, otherwise not constructed. In our experiments, the global scheme is adopted.

Local Constraints.
For , only consider each pair of and from the same class. The node and the node are connected if is among the most nearest nodes from or is in the circle neighbor region of . This is based on neighbor Euclidean distance. For , only consider each pair of and from the different classes. The edge between and is constructed if is among the most nearest nodes from or is in the circle neighbor region of . Here, , , , and are all alternative parameters. It is obvious that the local scheme is appropriate for unsupervised learning and semisupervised learning. It provides the flexibility in practical application. The latter scheme is adopted in our method. [40], in this section, we will give a closed-form solution for SML to avoid the iterative optimization over high-dimensional space. For the optimization problem (12), additionally, we introduce an orthogonal constraint; that is, = 0, for all ̸ = . The problem (12) may be transformed into the following maximum optimization:

Closed-Form Solution for CSML. Motivated by
where ( − ) is short for − . For simplicity, introduce the matrix notation In fact, is a weighted sum of covariance matrices of sample data. Next, will be computed, respectively. To obtain the best discriminant vector 1 , we introduce the following Lagrange function with multipliers : Considering 1 = 2 = ⋅ ⋅ ⋅ = = , compute the partial derivative of 1 with respect to 1 and set it to zero; then Here, 1 is the eigenvector of −1 associated with the largest eigenvalue.
To obtain other , we introduce the following Lagrange function with multipliers and : can be obtained by maximizing the above Lagrange function. As the above process, compute the partial derivative of with respect to and set it to zero: Multiply the two sides of (23) by ; then Thus represents the expression to be maximized. Considering (23), multiply its two sides successively by 1 −1 , . . . , −1 , and then obtain − 1 equations: The If we use matrix notations, The previous set of ( − 1) equations can be represented in a single matrix relationship: or in another form Let us multiply the two sides of (23) by −1 : This can be expressed using matrix notation as Including (28), we have Considering as the criterion to be maximized, is the eigenvector of and is associated with the largest eigenvalue of .

Singularity of .
To model the similarity measure, it only needs to obtain the largest eigenvalues of to constitute = [ 1 , . . . , ]. However, involving with the inverse of , it cannot be applied when is singular due to the small sample size problem. The small sample size problem occurs frequently in practice. In many applications, the dimensionality of the sample features is extraordinarily high while the number of samples is much small in comparison. When the number of samples is smaller than that of features, the small sample size problem occurs, for example, face recognition, text document classification, image retrieval, and cancer classification with gene expression profiling. The dimensionality of input space is high while the sample is often lacking. To handle this problem, the direct method is to replace −1 with the pseudoinverse matrix † . However, it does not guarantee that graph-preserving criterion is still optimized by the largest eigenvectors involved with † . Here, the problem is similar to that in LDA. For the singularity in LDA, there are several frequently used methods, which can be modified for SML. The common way is to add a singular value perturbation to to make it nonsingular [41]. Null subspace method and direct LDA [42] are both well known.
Another one is kernel Fisher's discriminant (KFD), which is a nonlinear extension to LDA. Maximum margin criterion (MMC) [43] modified the criterion in the fraction form into a difference one, which avoids the small sample size. In this work, we first employ PCA to reduce the dimensionality of the feature space to − 1, where is the number of samples and then apply SML on the dimensionality-reduced subspace.

Kernel CSML
CSML is used to find a global linear transformation matrix although the graph with local constraints may capture local nonlinear properties. In many cases, kernel trick is an efficient technique to extend a linear method to its nonlinear version.
In the feature space F, the generalized correlation similarity measure has the form Of course, it has the equivalent form Since, in the feature space F, lies in the linear combination of ( 1 ), ( 2 ), . . . , ( ), it can be defined as (37) The following Lagrange function with multipliers and is introduced: Comparing with the analysis of CSML, some notations are introduced: Note that the above notations are a little different from (26). Similarly, the final solution is obtained: 1 is the largest eigenvector of −1 ⋅ and is the largest eigenvector of the matrix Here, we note that the problem of the eigenvalue decomposition of (40) is ill-posed because the rank of the square matrix is less than or equal to − 1 and then is singular. To handle the singularity of , we simply add a small positive perturbation to , that is, replay by , where We set = 10 −3 in this work.

The Trace-Ratio, Ratio-Trace, and Trace-Difference.
It is known that the trace-ratio optimization problem is nonconvex and has no closed-form solution. CSML is the typical one of this type of problem. To solve such a problem, there have been some attempts. The most popular is to transform such problems into the ratio-trace problem. For (16), the corresponding ratio-trace form is which can be approximately solved with the general eigenvalue decomposition (GEVD) method: where is the th largest eigenvalue of the GEVD associating with the eigenvector and constitutes the th column vector of the matrix . Finally, = and the measure is learned. It can be seen that it is a suboptimal solution of the optimal problem (17) proposed in this paper. As pointed in [44], despite the existence of a closed-form solution for ratiotrace optimization problem, its approximation may sacrifice the potential classification capability of the derived lowdimensional feature spaces and is unstable for supervised classification. Guo et al. [45] converted such trace-ratio problem to a trace-difference one. However, it is solved by the iterative algorithm. For the detailed analysis on these attempts, we will refer the readers to the prior work [44,45].
In this work, an alternative approximate optimization problem and its solution are presented. The denominator of the original trace-ratio objective function is fixed and then the numerator is maximized alone. In fact, the problem (16) can be approximated to the trace-difference one as follows: which is the same as the objective function in [45]. However, the following operations of CSML are very different from those in [45]. In CSML it just involves the eigenvalue decomposition, which is more simple and comprehensible.

Computational Complexity.
The computational cost of CSML mainly comes from two parts. The first part is graph construction, that is, connecting each sample with its nearest neighbors, and its computational cost is ( 2 ). The next part is the matrix eigenvalue decomposition, and its computational cost is ( 3 ). So the overall cost is ( 3 + 2 ). For comparison, Table 1 illustrates the computational costs of several distance metric learning and dimensionality reductions related to CSML, where is the number of iterations. We can see that Xing's method is most expensive on computational cost. Our approaches CSML and KCSML are both more efficient than other several related algorithms.

Experiments
To evaluate proposed algorithms CSML and KCSML, in this section, we perform several image classification experiments on diverse databases and compare them with another popular related work. These comparable methods include principal component analysis (PCA), random subspace twodimensional PCA (RS-2DPCA), linear discriminant analysis (LDA), local preserving projection (LPP), marginal fisher analysis (MFA) [27], correlation embedding analysis (CEA), correlation discriminant analysis (CDA), improved similarity measure-based graph embedding (ISM-GE) [46], and maximal similarity embedding (MSE) [47]. PCA is taken as a baseline method. RS-2DPCA stands for the state of the art of unsupervised dimensionality reduction technique. LDA is a The Scientific World Journal 9 Correlation Embedding Analysis (CEA) ( ( 4 + 2 3 )) Correlation Similarity Measure Learning (CSML)  pixels. The feature of each image is represented by a 1,024-dimensional column vector. A random subset with images per individual is taken with labels to form the training set, where = 2, 3, 4, 5, 6, 7. The rest of the database is considered to be the testing set. For each given , 50 randomly splits are constituted. The results reported in Table 2 are the average values for 50 splits.
From comparisons in Table 2, we can observe that all the supervised methods outperform the unsupervised method PCA. It is easy to understand it since more class label information is introduced. We also see that CSML and KCSML both outperform the other competitive methods under all configurations, particularly, no matter being with sufficient or insufficient quantity of sample data. It confirms that the proposed generalized correlation similarity measure can effectively capture the intrinsic affinity structure of the data. The more experimental results in [21] show that PCA, LDA, and LPP perform better based on the correlation NN classifier. In our experiments, the correlation based methods (CDA, CEA, CSML, and KCSML) outperform the other methods based on Euclidean distance. In most cases, the kernel extension of CSML is better than its original version. From these results, it is obvious that the similarity measure is more effective in recognition tasks than Euclidean distance.

Classification on the CMU PIE Database.
The CMU PIE database contains 41,368 images of 68 people, each person under different poses, illumination conditions, and expressions. We select a subset, which contains images under five near frontal poses, different illuminations, and expressions. There are 170 images for each individual and 11,554 images in all. The images are cropped and resized to be 32×32 pixels. As processed in the former experiment, each image is unfolded as a column vector. A random subset with = 5, 10, 20, 30 images per individual is taken to form the training samples. The results in Table 3 are also average results of 50 splits for each .
From Table 3, CSML and KCSML greatly outperform the other competitive methods. PCA still performs worst. The selected subset used in the experiment contains more than 10 thousand images. From the experimental results in Table 3, we can conclude that the proposed similarity measure is effective and reliable on large scale databases. It further demonstrates the ability of the generalized correlation measure to capture the intrinsic structure of high-dimensional data.

Classification on the MNIST Database.
The MNIST database consists of 60, 000 handwritten digit images from the larger database NIST. We select randomly 500 images for each digit and then 5000 images in total from MNIST, to constitute a smaller subset as our experimental database. The images have been normalized into 28 × 28 pixels. The feature of each digit is represented by a 784-dimensional vector. As processed in the former experiments, the subset with = 50, 100, 150, 200, 250, 300 images per digit was taken to form the training sample set. And the all left images are taken as testing samples for each training subset. Also, for each , 50 random splits are constituted. The results in Table 4 are also average results of 50 splits for each .
With the experimental results in Table 4, the similar conclusion can be obtained.

Effects of Parameter Selection.
In our proposed algorithm, the -nearest neighbor search is twice applied. The first one is used for affinity graph construction in terms of the local constraints. for intrinsic graph and for penalty graph can be different and chosen with empirical values. Here, we assume = = 1 to simplify the analysis. In the above experiments, we adopt the global scheme to construct pairwise affinity graphs (intrinsic graph and penalty graph) avoiding tedious tuning work. The other one, denoted by 2 -NN, is used as the final classifier. For fairness, in the above experiments, all the compared methods uniformly use the same simple 1-nearest neighbor as the final classifier. In this subsection, to show more details of our proposed algorithm, we analyze the effects of these two types of parameters on the recognition performance. Figure 4 shows the error rate variations of CSML and KCSML with different 1 on the three databases. The corresponding 's are set to be the selected best ones in the above corresponding experiments. We find that the recognition performances of our proposed methods have a similar trend, where the error rate becomes approximately stable when the   Figure 5 shows the best dimension number 's under different 2 on the Yale database with "7 Train, " the CMU PIE with "30 Train, " and the MNIST with "300 Train. " We find that the 's have a very small variation with the changing of the 2 ; that is, we can choose a similar for different 2 . This result suggests that, for a given dataset, its intrinsic dimension number is determined no matter which classifier is selected. However, this has no benefit to parameter selection in practice. Generally, the parameter is set through timeconsuming cross-validation tests with empirical experiences. In the above experiments, all the parameters 's are chosen by the threefold cross-validation tests in the empirical value ranges.

Executive Time.
In our experiments, we also consider the comparison of computational efficiency of these algorithms. The CPU times of these methods are executed for fifty runs on Yale with = 7, CMU with = 30, and MNIST with = 300.
The result in log scale is summarized in Figure 6. It shows that the executive times of CSML and KCSML are closest to each other and both are far less than those of CEA and CDA, which have comparable recognition rates with our presented methods. It agrees with the theoretical analysis result in Table 1. We could conclude that our presented methods are more efficient than CEA and CDA.

Conclusion
In this paper, we have presented a general framework for similarity measure learning (SML). The proposed generalized correlation improves the flexibility of standard correlation. Based on the generalized correlation, a specific algorithm of SML, called CSML, and its kernel extension KCSML are proposed. Their objective functions are in trace-ratio form, which have no closed-form optimal solution. We transform the two objective functions to their approximate