^{1}

^{2}

^{1}

^{3}

^{1}

^{2}

^{3}

We introduce multiscale wavelet kernels to kernel principal component analysis (KPCA) to narrow down the search of parameters required in the calculation of a kernel matrix. This new methodology incorporates multiscale methods into KPCA for transforming multiscale data. In order to illustrate application of our proposed method and to investigate the robustness of the wavelet kernel in KPCA under different levels of the signal to noise ratio and different types of wavelet kernel, we study a set of two-class clustered simulation data. We show that WKPCA is an effective feature extraction method for transforming a variety of multidimensional clustered data into data with a higher level of linearity among the data attributes. That brings an improvement in the accuracy of simple linear classifiers. Based on the analysis of the simulation data sets, we observe that multiscale translation invariant wavelet kernels for KPCA has an enhanced performance in feature extraction. The application of the proposed method to real data is also addressed.

The majority of the techniques developed in the field of computational mathematics and statistics for modeling multivariate data have focused on detecting or explaining linear relationships among the variables, such as, in principal component analysis (PCA) [

PCA is often used for feature extraction in high dimensional data classification problems. The objective for PCA is to map the data attributes into a new feature space that contains better, that is, more linearly separable, features than those in the original input space. As the standard PCA is linear in nature, the projections in the principal component space do not always yield meaningful results for classification purposes. For solving this problem, various kernel-based methods have been applied successfully in machine learning and data analysis (e.g., [

Recently, wavelet kernels have been successfully used in support vector machines (SVM) learning to classify data because of their high flexibility [

Much current research has been focused on the development of multiscale kernel methods, for example, [

Our work is different from those discussed above as we focus on the construction of multiscale wavelet kernels for KPCA in data classification. We propose to use the multiscale kernels in the process of feature extraction rather than in the classification step. This innovation aims at extracting a set of better linear separable features so that a simple classifier can be applied for classification. Our method incorporates multiscale methods into KPCA, making wavelet kernel PCA (WKPCA) performs well in extracting data features. We do not search for optimal values of the kernel parameters of a given kernel that are often obtained by cross-validation methods. Instead, we focus on constructing multiscale wavelet kernels that are parameter free. We aim to investigate these kernels and to see how each of these kernels performs in multiscale data classification.

This paper is organized as follows. In Section

The KPCA aims, for a given data set

In the above derivation we assumed that

In this section a method of constructing a dot-product type wavelet kernel using multiple mapping functions is proposed. For a single-scale kernel (i.e., only one translated factor), the performance of KPCA in data classification may be affected by both a choice of a kernel and a choice of values of parameters of a kernel. The practical solution is first to investigate what choices of kernel are appropriate for the data, then to search for suitable kernel parameters of the given kernel based on the selected kernel. When data appears to be multiscale, for example, and exhibits nonstationarity in mean or in data variance, then the use of single-scale KPCA may not be a good choice as the feature extraction method in data classification due to the complex structure of the data.

The construction of kernels based on multiple mapping functions provides a framework for extending a single-scale kernel to a multiscale kernel in KPCA. Let

Using a mother wavelet function

Let

The proof of this theorem is provided in the Appendix.

As a special case, we obtain the single-scale dot-product wavelet kernel (SDWK)

Another type of the single-scale kernel is a distance function called translation invariant (TI) kernel [

Morlet mother wavelet has been recently used in signal classification and compression [

Morlet mother wavelet function is

The proof of this theorem is provided in the Appendix.

In general, a single-scale kernel, for example, the Gaussian kernel, is a smooth kernel and thus may not be able to capture some local behaviors of data. Wavelet kernels are more flexible than other types of kernels, for example, polynomial kernels or the Gaussian kernel. This is why the mother wavelet functions are adopted as kernels. Moreover, multiscale wavelet kernels combine multiple single-scale wavelet kernels at different scales. They are more flexible than single-scale wavelet kernels because both large and small scales are used in the kernel functions.

In this section, we discuss the computational issue of kernel matrix of multiscale wavelet kernel that needs to be addressed for KPCA. For a given data set,

For each

As we said earlier in kernel-based methods, it is important that the constructed kernel matrix is positive semidefinite. The kernel matrix

The purpose of simulation experiments is to explore the performance of our proposed method in applications to noisy multiscale data under different levels of the signal to noise ratio.

We consider two-class two-dimensional clustered data, denoted by

In this section, we discuss the results on how the WKPCA method performs in data classification. As we aim for linearly separable features, we apply the linear classifier, that is, Fisher linear discriminate (FLD), for our classification problems, to see if linearity of data is improved after feature extraction. The feature extraction methods by PCA, by single-scale WKPCA with respect to different values of kernel parameter

The training data and the test data are simulated using the simulation model described in Section

Figures

Classification accuracy rates of the different types of feature extraction methods: PCA (blue plots), STIWK PCA (green plots), MTIWK PCA (red plots) and MDWK PCA (black plots), for clustered data. The green plots correspond to the mean value of the classification accuracy rates of STIWK which are calculated over all values of the parameter

Mexican hat kernel and homogeneous clustered data

Mexican hat kernel and heterogeneous clustered data

Morlet kernel and homogeneous clustered data

Morlet kernel and heterogeneous clustered data

From the discussion in Section

In order to demonstrate the application of WKPCA as a feature extraction method in the classification of multiscale data, we simulate the training data and the test data using the simulation model described in Section

In Figure

The multiscale WKPCA with TI kernels outperforms the conventional PCA, STIWK PCA, and MDWK PCA. However, the results of classification accuracy rates are based only on one training data set and one test data set for each pair of (

The values of

Besides a choice of the kernel and the determination of kernel parameters, feature dimension is also an important issue. The classification accuracy for a given data set may depend on a choice of feature dimension, requiring an investigation of how classification accuracy is related to the feature dimension. Estimates of the average classification accuracy rate and its sample standard deviation are obtained by applying the Monte Carlo method. The results of the average classification accuracy rates for a different number of retained features are reported in Figures

Average classification accuracy rates when FLD classifier is used only (cyan plots) and when a feature extraction method plus FLD classifier is used for 100 Monte Carlo simulations. The considered different types of feature extraction methods are PCA (blue plots), STIWK PCA (green plots), MTIWK PCA (red plots), and MDWK PCA (black plots) for the data simulated with different values of

Gaussian kernel and 2 features

Mexican hat kernel and 2 features

Gaussian kernel and 4 features

Mexican hat kernel and 4 features

Gaussian kernel and 20 features

Mexican hat kernel and 20 features

Sample standard deviations of average classification accuracy rates when FLD classifier is used only (cyan plots) and when a feature extraction method plus FLD classifier is used for 100 Monte Carlo simulations. The considered different types of feature extraction methods are PCA (blue plots), STIWK PCA (green plots), MTIWK PCA (red plots), and MDWK PCA (black plots) for the data simulated with different values of

Gaussian kernel and 2 features

Mexican hat kernel and 2 features

Gaussian kernel and 4 features

Mexican hat kernel and 4 features

Gaussian kernel and 20 features

Mexican hat kernel and 20 features

In order to demonstrate how the proposed methods perform when applied to real data, we use a set of EEG signals coming from healthy volunteers and from patients during seizure-free intervals. EEG signals are typically multiscale and nonstationary in nature. The database is from the University of Bonn, Germany (

The problem we consider is the classification of normal signals (i.e., set A) and epileptic signals (i.e., set E). Since we deal with extremely high dimensional data (i.e.,

The plots of coefficients of wavelet approximation of sample EEG signals at various wavelet decomposition level, that is, 10, 9, and 8.

Set A, at level 10

Set E, at level 10

Set A, at level 9

Set E, at level 9

Set A, at level 8

Set E, at level 8

As the input signals are normalized before the wavelet analysis, we eliminate the differences of the signal energy between groups. The high variability of extracted features reflects a high signal variability in original time domain. As we can see, the extracted features of normal signals are more fluctuated than the epileptic ones. This fact is coincided with the clinical findings about the rhythms of epileptic signals, which are more regularly fluctuated, that is, tends to be more deterministic. For all three cases that we considered, the WKPCA coupled with different kernels is applied to the wavelet approximation coefficients of signals and up to 20 principal components are extracted from WKPCA. The obtained results of classification accuracy, using different types of wavelet kernel and simple classifiers, are reported in Figure

The classification accuracy with respect to different numbers of principal components retained under Gaussian and Mexican hat TI wavelet kernels, using the coefficients of wavelet approximation at various decomposition level. The classifiers used are FLD and 1-NN.

Gaussian kernel, at level 10

Mexican hat kernel, at level 10

Gaussian kernel, at level 9

Mexican hat kernel, at level 9

Gaussian kernel, at level 8

Mexican hat kernel, at level 8

This paper introduced a wavelet kernel PCA, in order to better capture data similarity measures in the kernel matrix. Multiscale wavelet kernels were constructed from a given mother wavelet function to improve the performance of KPCA as the feature extraction method in multiscale data classification. Based on analysis of the simulation data sets and the real data, we observed that the multiscale translation invariant wavelet kernel in KPCA has enhanced performance in feature extraction. The multiscale method for constructing a wavelet kernel in KPCA improves the robustness of KPCA in data classification as it tends to smooth out the locally modulated behavior caused by some types of mother wavelet. The application to real data was demonstrated through an EEG classification problem and the obtained results show the improvement of linearity after applying the multiscale WKPCA. Therefore, a simple linear classifier becomes suitable for classifying extracted features. This work focused on two important aspects: the first one was the construction of Mercer type wavelet kernel for kernel PCA and the second one was the investigation of the applicability of the proposed method.

The multiscale wavelet kernels proposed for the application in KPCA may also be useful for other kernel based methods in pattern recognition, such as support vector regression, kernel discriminant analysis, kernel density estimation, or curve fitting. Many kernel-based statistical methods require the optimization of kernel parameters, which is usually computationally expensive for high dimensional data. Because of this, the use of multiscale kernels is impractical as the computational cost is dramatically increased with the increase of the number of kernel parameters of a multiscale kernel. Instead, multiscale wavelet kernels enable to narrow down the search for the values of kernel parameters. This is because a linear combination of a set of multiple kernel functions constructed from a mother wavelet function is considered in this approach. It aims at capturing the multiscale components of the data. However, since the multiscale wavelet kernels are nonparametric, the performance of the kernel based methods using the multiscale wavelet kernels may not lead to an optimal solution to the problem.

Let

By the Fourier condition theorem in [

S. Xie acknowledges the financial support from MITACS and Ryerson University, under MITACS Elevate Strategic Post-doctoral Award. P. Lio is supported by RECOGNITION: Relevance and Cognition for Self-Awareness in a Content-Centric Internet (257756), which is funded by the European Commission within the 7th Framework Programme (FP7).