A Fast Logdet Divergence Based Metric Learning Algorithm for Large Data Sets Classification

. Large data sets classification is widely used in many industrial applications. It is a challenging task to classify large data sets efficiently, accurately, and robustly, as large data sets always contain numerous instances with high dimensional feature space. In order to deal with this problem, in this paper we present an online Logdet divergence based metric learning (LDML) model by making use of the powerfulness of metric learning. We firstly generate a Mahalanobis matrix via learning the training data with LDML model. Meanwhile, we propose a compressed representation for high dimensional Mahalanobis matrix to reduce the computationcomplexityineachiteration.ThefinalMahalanobismatrixobtainedthiswaymeasuresthedistancesbetweeninstances accuratelyandservesasthebasisofclassifiers,forexample,the 𝑘 -nearest neighbors classifier. Experiments on benchmark data sets demonstrate that the proposed algorithm compares favorably with the state-of-the-art methods.


Introduction
Recently, large data sets classification has become one of the hottest research topics since it is the building block in many industrial and computer vision applications, such as fault diagnosis in complicated systems [1,2], automated optical inspection for complex workpieces [3], and face recognition in large-capacity databases [4].In these large data sets, there are usually numerous instances represented in high dimensional feature spaces, which makes the problem of large data sets classification very difficult.
There are various classification algorithms which have been intensively explored, including Fisher's linear discriminant, support vector machines, and -nearest neighbor.However, these methods all rely on measuring the distance over the multidimensional feature space of instances accurately and robustly.Traditional distance metrics, including Euclidean and 1 distance, usually assign equal weights to all features and ignore the difference among these features, which is not practical in the real applications.In fact, these features may have different relevance to the category of instances.Some of them have strong correlation with the label of instances while others have weak or no correlation.Therefore, an appropriate distance or similarity metric which can build the relationship between feature space and category of instances should be learned to measure the divergence among instances.Metric learning is a popular approach to accomplish such a learning process.In this paper we select Mahalanobis distance as the distance metric between instances.
The Mahalanobis distance is a standard distance metric parameterized by a positive semidefinite (PSD) matrix .Given a data set {  }, with   ∈ R  ,  = 1, 2, . . ., , the square Mahalanobis distance between instances   and   is defined as The Mahalanobis distance satisfies all the conditions of metric definitions, including (1) nonnegativity,   (  ,   ) ≥ 0; (2) symmetry,   (  ,   ) =   (  ,   ); (3) triangle inequality,   (  ,   ) +   (  ,   ) >   (  ,   ); and (4) identity of indiscernibles,   (  ,   ) = 0 if only   =   .In the case that  = , where  is an identity matrix, the Mahalanobis distance degenerates to the Euclidean distance.Mahalanobis distance has some considerable advantages over other metrics.Firstly, the Mahalanobis distance is scale invariant, which means that the scale of the Mahalanobis distance has no influence on the performance of classification or clustering.Secondly, this metric takes into account the correlations of different features.In general, the element of the off-diagonal element in Mahalanobis matrix is not zero, which helps build a more accurate relationship among instances.When we apply singular value decomposition to the Mahalanobis matrix, it can be decomposed as  = Σ  .Here,  is a unitary matrix which satisfies   = , where left unitary matrix is the transpose of right unitary matrix due to the symmetry of Mahalanobis matrix .And Σ is a diagonal matrix which contains all the singular values.Thus, the square Mahalanobis distance can be rewritten as From ( 2) we can see that the Mahalanobis distance has two main functions.The first one is to find the best orthogonal matrix  to remove the couplings among features and build new features.The second one is to assign weights Σ to the new feature.These two functions enable Mahalanobis distance to measure the distance between instances effectively.Learning such a Mahalanobis distance is a complex procedure.Several classical metric learning algorithms such as probabilistic global distance metric learning (PGDM) [5], large margin nearest neighbor (LMNN) [6], and informationtheoretic metric learning (ITML) [7] have been proposed to learn the Mahalanobis distance.However, these algorithms seem computationally inefficient for large data sets, and it is a hard nut to accelerate the learning process with large data sets.In practice, there are two main challenges in scalability for large data sets.The first one is that the data sets may contain thousands of instances.To avoid local minimum in the metric learning process, as many as possible useful instances should be used in training.This leads to low computation efficiency in metric learning.The second one is that the dimensionality of the data may be very large.The number of parameters involved in the metric learning problem is (min( 2 ,  2 )), where  is the number of training instances and  is the dimensionality of the data.Thus, the running time for training Mahalanobis distance would be quadratic dependent on the number of dimensions.At the same time, estimating a quadratic number of parameters would also pose a new challenge [8].
In dealing with the challenge from numerous instances, we find online metric learning as a good solution.Online metric learning methods have two major advantages over traditional offline methods.First, in many practical applications, the system can only receive several instances or constraints at a time, and the desired Mahalanobis distance should be updated gradually over time.For example, in a process control system [9][10][11], various sensors are utilized to collect a group of feature data at one time, which may influence the Mahalanobis distance used in detecting fault.In this situation, online metric learning can be used to address the need of Mahalanobis distance updating.Second, some offline applications with numerous instances can be converted to online metric learning problems.Compared with offline learning, online metric learning reduces the running time dramatically, as the Mahalanobis distance is optimized step by step rather than calculated at a time.There are some online metric learning methods in the literatures, including pseudometric online learning algorithm (POLA) [12], online ITML algorithm [7], and Logdet exact gradient online metric learning algorithm (LEGO) [13].However, existing methods usually suffer from a number of drawbacks.The POLA involves a eigenvector extraction process in each step, which means a large computation load, especially with a high dimensional feature space.Although online ITML is faster than POLA, its improvement in computation efficiency is accompanied by loss in performance as the loss bounds of online ITML are dependent on the training data.The LEGO improves on the online ITML and achieves both high precision and fast speed at the same time.However, in the case of high dimensional feature space, LEGO fails to reduce the computational complexity at each step effectively.Thus, LEGO algorithm cannot well solve the problem illustrated in the second challenge.
To address the challenges and opportunities raised by larger data sets, this paper proposes a new metric learning strategy.First of all, we describe a novel online Logdet divergence based metric learning model which uses triplets as the training constraints.This model is shown to perform better than traditional metric learning algorithms in both precision and robustness.Then, to reduce the computational complexity in each iteration, a compressed representation for high dimensional Mahalanobis matrix is proposed.A low-rank Mahalanobis matrix is utilized to represent the original high dimensional Mahalanobis matrix in the metric learning process.As a result, the proposed algorithm solves the problems raised by numerous instances as well as high dimensional feature space.
The remainder of this paper is organized as follows.In Section 2, the proposed online Logdet divergence based metric learning model is presented.Then, the method of compressed representation for high dimensional Mahalanobis matrix is described in Section 3. Section 4 reports the experimental results on UCI machine learning repository to demonstrate the effectiveness of the proposed algorithm.Finally, we draw conclusions and point out future directions in Section 5.

Online Logdet Divergence Based Metric Learning Model
In the metric learning process, most successful results rely on having access to all useful instances or constraints in the whole data set.However, in some real applications, we cannot obtain all the instances at one time because of some reasons.For example, if there are too many instances in the training sets, reading in all the data may be out of memory of computer.Another example is that some online applications only provide several instances or constraints at one time.Therefore, we should desire a metric learning model which can update the Mahalanobis distance gradually as instances or constraints are received.Thus, our metric learning framework is to solve the following iterative minimization problem: where   > 0 is a regularization parameter which balances the regularization function (,   ) and loss function ℓ().
In this framework, the first item  ld (,   ) is a regularization function which is used to guarantee the stability of metric learning process.The function  ld ( ) represents Logdet divergence [14]: where  is the dimension of .There are several advantages when using Logdet divergence to regularize the metric learning process.First, the Logdet divergence between the covariance matrices is equivalent to the Kullback-Leibler divergence between corresponding multivariate Gaussian distributions [15].Second, the Logdet divergence is general linear group transformation invariant, that is,  ld (,   ) =  ld (  ,     ), where  is an invertible matrix [7].These desirable properties make Logdet divergence very useful in metric learning and the proposed algorithm in this paper is called Logdet divergence based metric learning (LDML).
The second item in the framework ℓ() is the loss function measuring the loss between prediction distance ŷ and target distance   at time step .Obviously, when the total loss function () = ∑  ℓ( ŷ ,   ) reaches its minimal, the obtained  is the most close to the desired distance function.There are several methods to choose the prediction distance ŷ and target distance   .In the proposed framework, we select triplet {  ,   ,   }, which represents the instance   is more similar to the instance   than instance   , as the labels of the training samples.The prediction distance is ŷ =    (   ,    ) −    (   ,    ) and the target distance is chosen as   = .Thus the corresponding loss function is expressed as In this formulation, the triplet constraints {  ,   ,   } which represent proximity relationships are used as constraints.In online ITML and LEGO algorithms, they all use pairwise constraints as training samples.If (  ,   ) belongs to the same category, the obtained Mahalanobis distance should satisfy    (  ,   ) < , where  is a desired superior limit of distance among instances in the same category; if (  ,   ) are dissimilar, the constraints for Mahalanobis distance  are    (  ,   ) > V, where V is a desired lower limit of distance among instances in the different categories.Although the pairwise constraints are weaker than the class labels [16], they are still stronger than triplet constraints.The reason is obvious, the distributions and instances quantities are different in different categories, but the desired superior limit  and lower limit V are the same for every category.Thus the Mahalanobis distance learned using online ITML and LEGO algorithms would get conservative results in this situation.The work [17] has pointed out that triplet constraints can be derived from pairwise constraints, but not vice versa.Therefore, the triplet constraint is weaker as well as more natural than pairwise constraints.And the corresponding online LDML algorithm can achieve more accurate results than online ITML and LEGO algorithms.

Compressed Representation for High Dimensional Mahalanobis Matrix
Although the online metric learning model can avoid semidefinite programming and reduce the amount of computations sharply, the computation complexity is restricted by the dimensionality of the feature space.As mentioned above, the number of parameters in Mahalanobis matrix is quadratic to the dimensionality .A Mahalanobis matrix with large number of parameters will lead to an inefficient computation in the metric learning process.To address this problem, we use compressed representations [8] method to learn, store, and evaluate the Mahalanobis matrix efficiently.
The Mahalanobis distance function  with a full × matrix is constrained as the sum of a high dimensional identity   plus a low-rank symmetric matrix   , expressed as where  ∈  × is orthogonal basis and  ∈  × is a symmetric matrix with  ≪ min(, ).Correspondingly, the Mahalanobis distance function at time step  can be decomposed as   =   +     .
Theorem 1.  ld (,   ) =  ld (,   ), where  =   +  and Proof.First of all, we consider the first item in (4): where the second equality follows from the fact of Woodbury matrix identity and the third equality follows from the fact that tr() = tr().
Then, the second item in (4) can be converted as where the second equality follows from the fact that det( −1 ) = det()/ det(), and the third equality follows from the fact that det(  +) = det(  +) for all  ∈  × and  ∈  × .Thus, we can get the following equation: hence proved.
From Theorem 1 we can see, if we build a relationship between  and  using the orthogonal basis , learning a low dimensional symmetric matrix  is equal to learning the original Mahalanobis distance function .The advantage of this method is that the computational complexity of each iteration will decrease significantly.Thus, it deserves obtaining the updating formulation of   to evaluate the true Mahalanobis distance function   .
Assume that  =    represents the reduceddimensional data under the orthogonal basis ; then we can get the corresponding variables   =   (   −    ) =     and   =   (   −    ) =     .Thus, the loss function can be rewritten as where  =      −      −      +      + .The function (,   ) +   ℓ() reaches its minimum when its gradient is zero.Thus, we get the following equation by setting gradient of (3) to be zero with respect to : Since matrix inverse is computationally very expensive, in order to avoid inverse, we apply the Sherman-Morrison inverse formula to solve (12).The standard Sherman-Morrison formula is However, in our updating equation, there are two items which are the outer product of vectors.To solve this problem, we assume that Γ  = ( −1  +        ) −1 , and ( 12) is split into two standard Sherman-Morrison inverse questions: Applying the Sherman-Morrison formula, we arrive at an analytical expression for  +1 The corresponding Mahalanobis distance function is evaluated as   =   + (  −   )  .Using the compressed representations technique, the computational complexity reduces from (min( 2 ,  2 )) to ( 2 ) per iteration.At the same time, the storage space in the metric learning process also reduces sharply.
There are several practical methods to build the basis [18]; one of the most efficient methods is to choose the  first left singular vector after applying the singular value decomposition (SVD) to the original training data : where  and  are left and right unitary matrix.And the orthogonal basis  is selected as This method is simple but time consuming.The computational complexity of singular value decomposition is ( 2  +  2 +  3 ) when  ≪ .In large data set, the objects are always with high  and .Thus, using the traditional SVD method will lead to enormous computation.Moreover, in this online metric learning model, the instances cannot be obtained at the orthogonal basis building step, which is regarded as a preprocess of the proposed method.Therefore, an online SVD algorithm should be introduced in our framework.We use a truncated incremental SVD [19,20] to obtain the basis in our algorithm, which can decline the computational complexity to ( 3 ); when  ≪ , the truncated incremental SVD algorithm will sharply reduce the computation time compared with traditional SVD algorithm.There are two main constraints for the regularization parameter   .First, it is used to make sure that  +1 is a PSD matrix in each iteration.When 0 <   < 1/tr(( −   ) −1        ), the  +1 will be a PSD matrix if   is a PSD matrix.This satisfies the first constraint.Second, it also controls the balance of the regularization function and the loss function.In this paper, we select   = /tr(( −   ) −1        ), where  is the learning rate parameter which is chosen between 0 and 1.On one hand, if  is too large, the  +1 will be mainly updated to minimize the loss function and satisfy the target relationship in the current triplet, which will lead to an unstable learning process.On the other hand, if  is too small, each iteration will have little influence on the updating of the Mahalanobis matrix.Thus, the metric learning process will be very slow and insufficient.Therefore, the selection of  should consider the tradeoff between efficiency and stability at the same time.

Experiments Results
In this section, we conduct experiments on several public domain data sets selected from UCI machine learning repository (http://archive.ics.uci.edu/ml/) to present the superiority of the proposed online LDML algorithm and the relationship between the performance and parameters.The parameters of these benchmarks are listed in Table 1.Some of them are with normal size while others have numerous instances or high dimension.
All the following experiments are tested in MATLAB 2011b, and all tests are implemented on a computer with Intel(R) Core(TM) i3-3120 M, 2.50 GHz CPU, 4G RAM, and Windows 7 operating system.The performance index is chosen as the classification accuracy of -nearest neighbor.The performance of all these algorithms is evaluated using 5-fold cross validation and the final results are the average of results obtained over 5 runs.In our proposed algorithm, when a new instance is recieved, it will be utilized to randomly build 2 triplets with instances which has been received before.Thus, the total number of the triplets is 2.Meanwhile, the learning rate parameter is set as  = 1/2, indicating that each triplet plays the same role in updating the Mahalanobis matrix.
The first experiment aims at illustrating the performance of the proposed compressed representation method in our online LDML algorithm.In this experiment, we try to use various compressed representation with different dimensionality in the metric learning process.The experiments are, respectively, conducted on 3 selected data sets, including "WDBC, " "Sonar, " and "Ionosphere." The dimensions of these three data sets are 30, 60, and 34.In this test, the number of compressed dimensions varies from 1 to the maximum dimension of the data sets.The cross validation classification precision and the running time which change with the number of dimensions are recorded.And Figure 1 gives the relationship among these three items.We can see that the running time increases exponentially while the precision stays relatively constant when dimensions reach a certain value.The reason for this phenomenon can be explained as follows.Although the Mahalanobis matrix  is used to build the relationship between features and the categories of instances, only a small part of elements in Mahalanobis matrix  make sense.Thus, using a low-rank Mahalanobis matrix  to represent the original  is enough.It is worth noting that if the rank of  is too low, the accuracy will decrease because  does not have enough elements to retain all the important information in metric learning process.
In the second experiment, we compare the proposed method with many other basic classification methods and the state-of-the-art online metric learning algorithms, including Euclidean distance, offline LDML [21], LEGO [13], online ITML [7], and POLA [12].The experiments are, respectively, conducted on 6 data sets in UCI machine learning repository, including "Iris, " "Wine, " "WDBC, " "Seeds, " "Sonar, " and "Ionosphere." These data sets are with normal dimension and number of instances.The testing results on cross validation classification accuracy for all data sets are summarized in Table 2.The results list the average and stand deviation of the cross validation classification accuracy over 5 runs.Meanwhile, the number of the compressed dimensions in the proposed online LDML method is also presented in the brackets.From the comparisons we can see, the proposed method outperforms other online metric learning methods as well as Euclidean distance.The precision and robustness of the proposed method are better than all other online metric learning methods.At the same time, compared with the offline LDML, the proposed method only loses a little precision but gains lots of efficiency.Table 3 illustrates     the comparison of running time of these methods.We can see that the LEGO, online ITML, and the proposed method have comparable running times on the normal size data sets.However, our approach has a significant improvement of running time compared with the offline LDML method.Besides, the computational efficiency of the proposed method also outperforms that of POLA a lot.Then, in the third experiment, tests are conducted on several large data sets to demonstrate the accuracy as well as efficiency of the online LDML algorithm.We select 5 large data sets from UCI machine learning repository, including "Letter-Recognition, " "Spambase, " "Isolet, " "Semeion, " and "Mfeat." From Table 1 we can see some of them contain numerous instances while others are with high dimensional feature space.We mainly compare the proposed method with LEGO algorithm.The average accuracy and the average running time are illustrated in Table 4.We can see that the precisions of the proposed method on all data sets are better than that of LEGO, especially on the data set "Spambase." When it comes to the running time, the performance of these two methods is totally different in different data sets.The LEGO algorithm runs fast on "Letter-Recognition" but it has low efficiency on other data sets, including "Isolet" and "Mfeat." However, the proposed method can reduce computational complexity on "Isolet" and "Mfeat" but it cannot run fast on "Letter-Recognition. " The reason of these phenomena is not obvious, and further experiments have been conducted to explain these findings.
In the following experiments, we test the relationship among number of instances, number of dimensions, and average running times.In this experiment, we firstly compare the changes of the running time when the number of instances increases.The experiment is conducted on the data set "Letter-recognition" and the result is shown in Figure 2(a).Although the running time of both methods is linear to the number of instances.The running time of the purposed time is a little faster than that of LEGO.The reason is that the online LDML requires computing the orthogonal basis .Although we have applied the truncated incremental SVD to compute the orthogonal basis, the computational complexity of truncated incremental SVD is ( 3 ).And it can reduce running time sharply only when  ≪ .However, in the case of "Letter-Recognition, " the  is 26 and  is 10.The truncated incremental SVD can not work efficiently in this situation.Another experiment is to illustrate the changes of the running time when the number of dimensions increases.The experiment is conducted on the data set "Mfeat" and the result is shown in Figure 2(b).In this experiment, we gradually increase the feature dimension of the original data while online LDML only uses a 20 dimensional compressed representation all the time.We can see that the running time of proposed method stays of a very low value while that of LEGO increases the square of the number of feature dimensions.Therefore, the LEGO cannot deal with data sets with large feature dimension.The proposed method can reduce lots of computation time in each iteration while keeping high classification performance.This is the main advantage of the online LDML algorithm.

Conclusion
In this paper we propose a fast and robust metric learning algorithm for large data sets classification.Since large data sets usually contain numerous instances represented in high dimensional feature spaces, we propose to use an online Logdet divergence based metric learning model to improve the computation efficiency of learning with thousands of instances.Furthermore, we use a compressed representation of high dimensional Mahalanobis matrices to reduce the computational complexity in each iteration significantly.The proposed algorithm is shown to be efficient, robust, and precise by experiments on benchmark data sets and comparison with state-of-the-art algorithms.In future work we plan to further optimize the proposed algorithm with respect to computation efficiency and precision.

Figure 1 :
Figure 1: The relationship among the number of dimensions, average classification precisions, and average running times.(a) The experiment results on data set "WDBC;" (b) the experiment results on dataset "Ionosphere;" (c) the experiment results on data set "Sonar."

Figure 2 :
Figure 2: The relationship among number of instances, number of dimensions, and average running times.(a) The relationship between number of instances and average running times on data set "Letter-Recognition;" (b) the relationship between number of dimensions and average running times on data set "Mfeat."

)
When receiving a new triplet {   ,    ,    } at time step , if    (   ,    ) −    (   ,    ) ≥ , there is no loss when using the current   to represent the relationship among these three instances; if    (   ,    ) −    (   ,    ) < , the current   should be updated to a better Mahalanobis distance to reduce the loss.

Table 1 :
Data sets used in the experiments.

Table 2 :
Cross validation classification accuracy comparison with the state-of-the-art metric learning methods on the normal size data sets.

Table 3 :
Running time (s) comparison with the state-of-the-art metric learning methods on the normal size data sets.

Table 4 :
Performance comparison with the state-of-the-art online metric learning methods on the large data sets.