Regularized Embedded Multiple Kernel Dimensionality Reduction for Mine Signal Processing

Traditional multiple kernel dimensionality reduction models are generally based on graph embedding and manifold assumption. But such assumption might be invalid for some high-dimensional or sparse data due to the curse of dimensionality, which has a negative influence on the performance of multiple kernel learning. In addition, some models might be ill-posed if the rank of matrices in their objective functions was not high enough. To address these issues, we extend the traditional graph embedding framework and propose a novel regularized embedded multiple kernel dimensionality reduction method. Different from the conventional convex relaxation technique, the proposed algorithm directly takes advantage of a binary search and an alternative optimization scheme to obtain optimal solutions efficiently. The experimental results demonstrate the effectiveness of the proposed method for supervised, unsupervised, and semisupervised scenarios.


Introduction
Dimensionality reduction (DR) methods in supervised, unsupervised, and semisupervised learning tasks have attracted much attention in computer vision and pattern recognition [1][2][3][4][5][6]. These methods are often considered as feature extraction methods for high-dimensional signals from various application fields, such as transportation, communications, plants, and mines. Unsupervised dimensionality reduction, such as principle component analysis (PCA) [7], does not utilize any label information. Linear discriminant analysis (LDA) is a popular supervised dimensionality reduction method, which derives a projection from simultaneously maximizing the between-class scatter and minimizing the within-class scatter. Semisupervised dimensionality reduction, such as semisupervised discriminant analysis (SDA) [8], makes good use of labeled data while preserving the intrinsic geometric structures of unlabeled data.
In order to handle the data sampled from a low-dimensional manifold, some nonlinear dimensionality reduction methods, such as isometric feature mapping (ISOMAP) [9], locally linear embedding (LLE) [10], and Laplacian Eigenmap (LE) [11], introduce manifold assumption into dimensionality reduction and aim to maximally preserve certain interpoint relationships. But these methods cannot address the outof-sample extension problem. Thus, locality preserving projections (LPP), as a linear approximation of LE [12], were proposed to both uncover the data manifold and provide out-of-sample extensions. These dimensionality reduction methods could be unified under a framework called graph embedding [13]. To achieve significant improvements, it is feasible to kernelize a certain type of linear methods into nonlinear ones [14][15][16][17][18]. But, the performances of the kernelized versions heavily rely on the selections of kernel functions. With inappropriate kernels, the performances will be degraded and become even worse.
Recently, the advantage of using multiple kernels instead of only one kernel for dimensionality reduction has been demonstrated [15,19]. Multiple kernel learning for dimensionality reduction (MKL-DR) was proposed to learn an appropriate kernel from the multiple base kernels and a transformation into a lower dimensionality space simultaneously [20]. But, MKL-DR relaxes a nonconvex quadratically constrained quadratic programming (QCQP) into 2 Computational Intelligence and Neuroscience a semidefinite programming (SDP), which is very timeconsuming and has a negative effect on its performance. Recently, a multiple kernel learning method called MKL-TR was proposed to improve the performance of MKL-DR [21]. MKL-TR formulates multiple kernel learning for dimensionality reduction as a trace ratio maximization problem. But both MKL-DR and MKL-TR need to iteratively compute generalized eigendecomposition of dense matrices. Motivated by the efficiency of spectral regression, a fast multiple kernel dimensionality reduction method, termed as MKL-SRTR, was presented to avoid generalized eigendecomposition of dense matrices [22]. It is more efficient than MKL-DR and MKL-TR by virtue of spectral regression. Since MKL-DR, MKL-TR, and MKL-SRTR are all based on graph embedding and manifold assumption, they cannot cope with manifold assumption invalidation. In addition, MKL-DR and MKL-SRTR might be ill-posed if the rank of matrices in their objective functions was not high enough [21].
Since spectral clustering and multiple kernel dimensionality reduction have the same form of optimization based on the manifold assumption, motivated by the spectral embedded clustering framework proposed in [22], we firstly extend the traditional graph embedding framework by incorporating linear regularization terms into its model, termed as extended graph embedding (EGE). Secondly, we introduce multiple kernel learning into EGE (termed as MKL-EGE) to improve the performance of single kernel DR. Compared with traditional multiple kernel dimensionality reduction methods, such as MKL-SRTR, the proposed method not only solves the ill-posed problems but also is more robust against high-dimensional or sparse data. Furthermore, our method directly utilizes a binary search and an alternative optimization scheme to obtain optimal solutions. The experimental results demonstrate that the proposed method achieves better or similar performance compared to other algorithms for supervised, unsupervised, and semisupervised settings.
The remainder of the paper is structured as follows. In Section 2, we briefly introduce the related work. We provide the MKL-EGE framework and the optimization process in Section 3. The experimental results are shown in Section 4. Finally, we give the related conclusions in Section 5. In order to avoid confusion, we give a list of the main notations used in this paper in Notations.

Graph Embedding and Its Extension
2.1. Graph Embedding. Specifically, denote an undirected weighted graph by G = {X, W}, where X = [x 1 , x 2 , . . . , x ] ∈ R × is a vertex set and W ∈ R × represents an affinity matrix. Each entry of the symmetric matrix W is the edge weight that characterizes the similarity between a pair of vertices of G. A dimensionality reduction scheme aims at finding a low-dimensional subspace F = [f 1 , f 2 , . . . , f ] (f ∈ R , ≪ ) by a complete graph G whose vertices are over X. The purpose of graph embedding is to represent each vertex of a graph as a low-dimensional vector and preserves similarities between the vertex pairs. The optimal F could be obtained by solving where L = D − W is the graph Laplacian matrix of G and D = diag(D 1 , . . . , D ) is a diagonal matrix with the diagonal elements defined as D = ∑ =1 w . L = D − W is the graph Laplacian matrix of another weighted graph G . By specifying W and D (or W and W ), the PCA, ISOMAP, LLE, LPP, LDA, local discriminant embedding (LDE), marginal Fisher analysis (MFA) [13], and spectral regression (SR) [23,24] can all be expressed as graph embedding. Since L = D − W and the constraint tr(F DF) = 1 are commonly used, in this paper, we mainly discuss the following form of graph embedding: which usually relaxes to the following objective function: . (3)

Extended Graph
Embedding. The term tr(F LF) in problem (4) is actually derived based on the manifold assumption [25]. However, for high-dimensional or sparse data, this assumption may not hold due to the bias caused by the curse of dimensionality. Thus, the low-dimensional manifold structure cannot be exploited by the inaccurate similarity matrix, which would result in the performance degradation of graph embedding.
To address this issue, we try to improve traditional graph embedding framework. Notice that the term tr(F LF) can be regarded as the objective function of spectral clustering; we use the spectral embedded clustering method proposed in [26] to extend the graph embedding framework. Specifically, we minimize the following objective function: where and are two regularization parameters, 1 denotes the × 1 vectors of all 1s, and the second term characterizes the mismatch between the low-dimensional feature matrix F and the low-dimensional representation of the data. Proof. By setting the derivatives of the objective function (4) with respect to W and b to zeros, we have b = 1 F 1 , By substituting W and b in (4) by (6), the optimization problem (4) becomes where L = I − (1/ )1 1 −X(X X + I ) −1 X . This completes the proof of Theorem 1.
From problem (5), we can find that the form of EGE is similar to that of GE and GE is a special case of EGE when = 0.L = L + L can be regarded as a correction of the graph Laplacian matrix L for high-dimensional data.
Since L = D − W, problem (5) can be transformed into the following form:

Multiple Kernel Learning Based on EGE and Trace Ratio Maximization
Since MKL-DR, MKL-TR, and MKL-SRTR can be viewed as multiple kernel versions of graph embedding, it is natural to establish a multiple kernel learning framework for dimensionality reduction based on EGE.
3.1. Formulation. Suppose the ensemble kernel K is generated by linearly combining the base kernels {K } =1 ; that is, We can find a sample coefficient matrix A and a kernel weight vector by the following trace ratio optimization problem based on extended graph embedding: where It should be noted that dimensionality reduction based trace ratio optimization tends to overfitting [27,28]. To address this issue, a regularization term tr(A IA) is added to the denominator of problem (9) to ensure that KDK + I is of full rank. Hence, the objective function could be expressed as follows: Compared with MKL-SRTR, the proposed method is based on the extended graph embedding framework. Thus, it has more robustness against high-dimensional or sparse data. In addition, our method avoids ill-posed problems.

Method.
To optimize our objective function, the following function that satisfies constraints (13)-(15) is defined: 4 Computational Intelligence and Neuroscience The optimal value of the objective function in (15) is the root of the function ( ) = max A A=I tr(A (K(W − L − D)K − I)A) [27,28]. Based on (15), we update , A, and alternately.
On Optimizing A and . By fixing , optimization problem (11) is simplified to where Thus, a binary search (giving a lower bound and an upper bound) is used to seek * such that ( * ) = 0. The value of ( ) can be easily calculated as the sum of the first largest eigenvalues of S 1 − S 2 . Optimal A * is finally obtained by performing the eigenvalue decomposition of S 1 − * S 2 .
On Optimizing . By fixing A and , can be obtained by solving the following optimization problem: We define a function with given A and as follows: and we have Thus, can be determined by updating the projections of in the direction of Q/ . Finally, we define a quadratic programming to satisfy the constraint ∑ = 1 as where 1 denotes × 1 unit vector.

Algorithms.
The proposed algorithm based on EGE and regularized trace ratio, termed as MKL-EGE, is described in Algorithm 1. As can be seen from Algorithm 1, MKL-EGE utilizes a binary search in inner iterations to speed up convergence and adopts updating A and alternately in outer iterations to seek optimal solutions. Since the proposed algorithm cannot guarantee obtaining the optimal solution * exactly, we terminate it within a maximum iteration and choose the best result.
where iter 1 is maximum number of outer iterations. MKL-DR needs to solve the SDP problem in each iteration, which is as high as ( 6.5 ) [20].
The computational complexity of MKL-TR decreases to (iter 1 (iter 2 ( 2 + 3 )+ 3 )) [21]. Since MKL-EGE only needs a small number of iterations to converge, the computational complexity of our method is much lower than that of MKL-DR and MKL-TR.

Unseen Sample Embedding.
After accomplishing the training procedure of MKL-EGE, we can project a new sample v into the learned subspace by

Experiments
We compared the proposed MKL-EGE algorithm with MKL-DR [20], MKL-TR [21], and MKL-SRTR [    We used libSVM [29] with linear kernel to classify the embedding data. All experiments were independently carried out over 20 times. The mean classification accuracies and the standard deviations of different algorithms are displayed in Table 2. As can be seen from Table 2, MKL-EGE significantly outperforms MKL-DR, MKL-TR, and MKL-SRTR in most datasets, which achieves 11 best recognition rates among all 13 datasets. In particular, the performance of MKL-EGE is much better than that of other algorithms on high-dimensional datasets such as Yale, PIE, ORL, and COIL-20. This is due to the fact that MKL-EGE incorporates EGE and linear regularization terms into its model, which is effective for handling highdimensional data and can avoid overfitting. Consequently, 6 Computational Intelligence and Neuroscience MKL-EGE is more robust than other algorithms based on traditional graph embedding. In addition, the performance of MKL-EGE is very close to that of MKL-TR and MKL-SRTR on low-dimensional dataset, such as Ionosphere, which shows that the proposed method is effective for both lowdimensional and high-dimensional data. The performance of MKL-DR is worst among all algorithms, which validates that the SDP relaxation technique applied in MKL-DR has a negative influence on the performance of dimensionality reduction. The performance of MKL-TR is similar to that of MKL-SRTR, since MKL-SRTR only utilizes spectral regression to improve the speed of MKL-TR. We used all samples from each class of ORL as training data and used different algorithms to obtain corresponding two-dimensional embedding results. To further validate and compare the final results among different algorithms, we also tested them on PIE, which has the maximum number of samples. The final embedding results are shown in Figures  1 and 2, respectively. As can be seen from Figures 1 and  2, the embedding data obtained by MKL-DR, MKL-TR, and MKL-SRTR is overlapped more seriously than MKL-EGE. The embedding data obtained by MKL-EGE has the best separability, which demonstrates that MKL-EGE is more effective than other algorithms for high-dimensional face data. Consequently, the performance of classification using SVM based on MKL-EGE is best compared to other algorithms.
To compare the computational time of different algorithms, we used all data samples of each dataset as training data to perform different multiple kernel dimensionality reduction methods. The results are displayed in Figure 3. From Figure 3, we can see that MKL-SRTR and MKL-EGE are much faster than MKL-DR and MKL-TR. Since MKL-EGE utilizes a binary search in inner iterations to speed up convergence, its speed is only a little slower than that of MKL-SRTR for the sake of eigenvalue decomposition of dense matrices. The convergence curves of MKL-EGE and MKL-SRTR are displayed in Figure 4. As can be seen from Figure 4, the speed of convergence for MKL-EGE is faster than that of MKL-SRTR; this is due to the fact that Computational Intelligence and Neuroscience  MKL-SRTR needs to predefine step length of parameter and does not adjust adaptively step length in each iteration. For comparing the approximation performances of different algorithms, Figure 5 shows the histograms of the ( ) values obtained by all algorithms in 100 runs. As can be seen from Figure 5, compared with other algorithms, the approximate solutions of MKL-EGE are more concentrated near zero, which validates that our algorithm can more effectively find the root * approximately. Overall, the proposed method is the most cost-efficient among all algorithms.

Experiments on Unsupervised Learning.
To evaluate the performance of MKL-EGE in unsupervised settings, we first used all algorithms to project the original data onto a subspace, where the normalized cut spectral clustering (NC) [30] algorithm was performed to evaluate the clustering performance. For MKL-TR, we set M = W and N = diag(W1), where W is the affinity matrix for MKL-EGE, MKL-SRTR, and MKL-DR. In the unsupervised case, we set the number of clusters as the number of classes in each dataset. In order to evaluate the clustering performance, the normalized mutual information (NMI) and Rand index (RI) [31] were adopted.
We used the same datasets and the same preprocessing procedure as in supervised learning experiments. For unsupervised MKL-DR, initializing A first obtained more stable performances. Thus, this strategy was adopted in the experiments. To obtain stable results, for each dataset, we computed the average results of each algorithm over 20 runs.
The values of NMI and RI obtained by these algorithms are reported in Tables 3 and 4, respectively. From  Tables 3 and 4, we can see that MKL-EGE performs better than other algorithms in most datasets, which demonstrates that it can improve the performance of dimensionality reduction by using EGE and regularization terms. Consequently, it has the ability to find a more effective combination of base kernels in unsupervised settings. MKL-TR and MKL-SRTR evidently outperform MKL-DR, which indicates that the SDP relaxation used in MKL-DR also has a negative effect on the performance of dimensionality reduction in this case.
Computational Intelligence and Neuroscience 9 and 0 < ≤ 1 is the parameter to adjust the weight between the label information and unsupervised neighbor information. For MKL-TR, we set M = D − W and N = D − W . is set as 0.1 for all algorithms.
In semisupervised settings, the same datasets and parameter initialization were used. We randomly selected one-half training data as labeled data for each dataset. Each algorithm was independently performed over 20 times. The average classification accuracies as well as the standard deviations are reported in Table 5. As can be seen from Table 5, the proposed MKL-EGE algorithm performs better than MKL-SRTR, MKL-TR, and MKL-DR. Our proposed algorithm, which effectively takes advantage of EGE and regularized trace ratio optimization, can automatically learn weights of base kernels and combine them to improve the performance of dimensionality reduction. By virtue of the same prior information, the proposed algorithm achieves 10 best results among 13 datasets compared with these state-of-the-art methods.
To visualize the semisupervised dimensionality reduction results, we used all samples from the first 10 classes of PIE and projected them into a two-dimensional subspace to generate a graphical representation, shown in Figure 6. From Figure 6, we can observe that the embedding data obtained by MKL-EGE and MKL-SRTR is separated from each other more clearly than MKL-DR and MKL-TR. The embedding data obtained by MKL-EGE has the best separability, which further validates that the performance of MKL-EGE is much better than that of other algorithms in the semisupervised case.

Experiments on Real World Datasets.
To evaluate the effectiveness of MKL-EGE on real world datasets, it serves as a feature extraction method for bearing vibration signals, which were provided by bearing accelerometer sensors under different operating loads and bearing conditions from mines. The vibration signals were collected by using a 16-channel digital audio tape (DAT) recorder at the sampling frequency 12 kHz. Similar to the experimental settings in [35], the experimental vibration data were divided into four datasets, named as D IRF, D ORF, D BF, and D MIX shown in Table 6, where "07," "14," "21," and "28" mean that fault diameter is 0.007, 0.014, 0.021, and 0.028 inches. We used one-half vibration data as training samples and another one-half as testing samples.
Similar to the experimental settings in [35], we firstly transformed the obtained vibration signals into 10 time domain features, 3 frequency domain features, and 16 timefrequency domain features. Secondly, low-dimensional features were extracted for performing bearings fault diagnosis or prognosis. Finally, SVM was used to evaluate the performance of different DR methods. The first three extracted features corresponding to the largest eigenvalues are employed    as the input features of SVM. The classification accuracy rates are reported in Table 7. It can be observed that MKL-EGE achieves much better results compared to other algorithms on all datasets, which further demonstrates the effectiveness of our method for feature extraction of vibration signals in real applications.

Conclusion
In this paper, we propose a new multiple kernel dimensionality reduction method called MKL-EGE. By means of EGE and regularized trace ratio maximization, the proposed method not only avoids the SDP relaxation of MKL-DR but improves the performance of multiple kernel dimensionality reduction further. Moreover, the proposed algorithm makes good use of the binary search and alternative optimization scheme to efficiently find optimal solutions. Experimental results validate the effectiveness of this method. In the future, we plan to incorporate pair constraints into our framework and exploit multiple kernel dimensionality reduction via convex optimization.

Notations
: Theinput -dimensional Euclidean space : The number of total data points : The number of classes that the samples belong to X: X = [x 1 , . . . , x ] ∈ R × is the training data matrix Y: Y = (y 1 , . . . , y ) ∈ R × is the 0-1 label vector is the lable of x (x, y): Kernel function of data vectors x and y K: K e r n e lm a t r i xK = { (x , x )} ∈ R × {K } =1 : Base kernels : = [ 1 , . . . , ] ∈ R , representing nonnegative coefficients of base kernels K: The ensemble kernel K = ∑ =1 K tr(M): The trace of the matrix M, that is, the sum of the diagonal elements of the matrix M.