Wavelet Kernel Principal Component Analysis in Noisy Multiscale Data Classiﬁcation

. We introduce multiscale wavelet kernels to kernel principal component analysis (KPCA) to narrow down the search of parameters required in the calculation of a kernel matrix. This new methodology incorporates multiscale methods into KPCA for transforming multiscale data. In order to illustrate application of our proposed method and to investigate the robustness of the wavelet kernel in KPCA under di ﬀ erent levels of the signal to noise ratio and di ﬀ erent types of wavelet kernel, we study a set of two-class clustered simulation data. We show that WKPCA is an e ﬀ ective feature extraction method for transforming a variety of multidimensional clustered data into data with a higher level of linearity among the data attributes. That brings an improvement in the accuracy of simple linear classiﬁers. Based on the analysis of the simulation data sets, we observe that multiscale translation invariant wavelet kernels for KPCA has an enhanced performance in feature extraction. The application of the proposed method to real data is also addressed.


Introduction
The majority of the techniques developed in the field of computational mathematics and statistics for modeling multivariate data have focused on detecting or explaining linear relationships among the variables, such as, in principal component analysis (PCA) [1].However, in real-world applications the property of linearity is a rather special case and most of the captured behaviors of data are nonlinear.In data classification, a possible way to handle nonlinearly separable problems is to use a non-linear classifier [2,3].In this approach a classifier constructs an underlying objective function using some selected components of the original input data.An alternative approach presented in this paper is to map the data from the original input space into a feature space through kernel-based methods [4,5].
PCA is often used for feature extraction in high dimensional data classification problems.The objective for PCA is to map the data attributes into a new feature space that contains better, that is, more linearly separable, features than those in the original input space.As the standard PCA is linear in nature, the projections in the principal component space do not always yield meaningful results for classification purposes.For solving this problem, various kernel-based methods have been applied successfully in machine learning and data analysis (e.g., [6][7][8][9][10]).The introduction of the kernel allows working implicitly in some extended feature space, while doing all computations in the original input space.
Recently, wavelet kernels have been successfully used in support vector machines (SVM) learning to classify data because of their high flexibility [9,11].The Gaussian wavelet kernel, one of the most common kernels used in practice, has been used as either a dot-product kernel or a translation invariant kernel.Besides them, many other possible wavelet kernels are commonly used, including the cubic B-spline wavelet kernel, Mexican hat wavelet kernel, or Morlet wavelet kernel.Although kernel-based classification methods enable capturing of nonlinearity of the data attributes in the feature space, they are usually sensitive to the choices of parameters of a given kernel [6].Similarly, in kernel PCA (KPCA) [12,13], optimization of kernel parameters is difficult.The search

Kernel PCA
The KPCA aims, for a given data set {x 1 , . . ., x n : x j ∈ R d for all j}, to capture the nonlinear relationships among the data by mapping the original observations x 1 , . . ., x n ∈ R d into a feature space F that is spanned by column vectors Φ(x 1 ), . . ., Φ(x n ), where the function Φ(•) maps x i into the feature space, for each i = 1, . . ., n [2,3,13].The map Φ(•) is usually determined by the Gaussian function, or by a polynomial function, or by a reproducing kernel in Hilbert space.Here, we focus on the wavelet kernels that result in positive semidefinite kernel matrices.Assuming that the data Φ(x 1 ), . . ., Φ(x n ), in the feature space, are centered (this assumption will be relaxed later), and viewing Φ(x 1 ), . . ., Φ(x n ) as independent random vectors, for x i ∈ R d , the sample covariance matrix of these random vectors can be written as (see [2]) follows: The aim of PCA applied to the covariance matrix C is to find the eigenvalues λ and eigenvectors V of C. The aim of the eigenvalue analysis is to choose eigenvectors V to be spanned by Φ(x 1 ), . . ., Φ(x n ), so that, the calculation of eigenvalues and eigenvectors can be done through the so-called kernel matrix K, which is defined by: The objective of the principal component extraction is to project the transformed observation Φ(x) into the linear space spanned by the normalized eigenvectors c l , for l = 1, . .., n.As we focus only on Mercer kernels, C is a positive semidefinite matrix and all its eigenvalues are positive.Thus, the coefficients of the projected vector Φ(x) are given by where ) and l = 1, . . ., n.In the above derivation we assumed that Φ(x 1 ), . . ., Φ(x n ) are centered.In practice, one needs to relax this condition.Therefore, instead of using the kernel matrix K, one should work with the centered version of K, which is given by the following expression: where The details of the derivation of K * can be found in [2].

Multiscale Dot-Product Wavelet Kernel Construction
In this section a method of constructing a dot-product type wavelet kernel using multiple mapping functions is proposed.For a single-scale kernel (i.e., only one translated factor), the performance of KPCA in data classification may be affected by both a choice of a kernel and a choice of values of parameters of a kernel.The practical solution is first to investigate what choices of kernel are appropriate for the data, then to search for suitable kernel parameters of the given kernel based on the selected kernel.When data appears to be multiscale, for example, and exhibits nonstationarity in mean or in data variance, then the use of single-scale KPCA may not be a good choice as the feature extraction method in data classification due to the complex structure of the data.The construction of kernels based on multiple mapping functions provides a framework for extending a single-scale kernel to a multiscale kernel in KPCA.Let φ i : . ., g} be a nonlinear map and F i be the respective feature space, where x is a column vector and g is the total number of mapping functions.From φ i , we construct another mapping function φ i : x ∈ R d defined as where H is a Hilbert feature space being the direct sum of F i and φ i (x) is a column vector with dg entries for a given x.Define a new map Φ * based on φ i (x) as Φ * (x) := ( φ 1 (x), . . ., φ g (x)).In this case, Φ * maps x into a dg × g 2-D feature space.Using the map Φ * as a feature map in KPCA, the original data set {x 1 , . . ., As a result, Φ * has ng columns, a number that is usually very large.The high dimension of the feature map causes an intensive computation problem in KPCA.One of the solutions to this problem is to reduce the dimension of Φ * .Instead of arranging Φ * (x) in a matrix, we arrange them into a vector which replaces , where √ α i is a weight coefficient applied to the map φ i (x) and α i is a positive real value.For simplicity, α i can be chosen as 1/g.Without loss of generality, we assume that Φ(x) have zero means.For x, y ∈ R d , using Φ(x) as the feature map, the kernel function in KPCA becomes due to the fact that φ i (x) • φ j (y) T = 0 for i / = j.If we denote Therefore, a single-scale kernel is just a special case of a kernel with multiple mapping functions that takes g = 1 and α i = 1.
Using a mother wavelet function ψ jk (•) with dilation factor a j and translation factor b k , for j = 0, . . .J − 1 and 0 ≤ k ≤ N, as a set of basis functions of the mapping function φ i (x), and taking α i = 1/a j , the kernel function in (6) can be rewritten as We call the kernel function in (7) the multiscale dot-product wavelet kernel (MDWK).The MDWK is a special case of the dot-product wavelet kernel when the dilated translated versions of a mother wavelet function are chosen as multiple mapping functions.In kernel-based methods, it is required that the constructed kernel must be a Mercer kernel, that is, it must have a positive semidefinite kernel matrix [3].
Theorem 1.Let ψ(x) be a mother wavelet function, let a j , b k ∈ R + denote the dilation and translation factors, respectively, then for any x, y ∈ R d and a finest resolution level J, the dot-product wavelet kernel function The proof of this theorem is provided in the Appendix.As a special case, we obtain the single-scale dot-product wavelet kernel (SDWK) where a ∈ R + and b ∈ R.

Multiscale Translation Invariant Wavelet Kernel Construction
Another type of the single-scale kernel is a distance function called translation invariant (TI) kernel [11].The TI kernel is defined as k(x, y) = Φ(x − y), where x, y ∈ R d .However, for a TI kernel to be used as a kernel in KPCA, again, one has to show that the kernel matrix constructed from the TI kernel is positive semidefinite.To show this, we notice that k j (x, y) = d i=1 ψ j ((x i − y i )/a), where a is a single-scale parameter.A kernel defined as is also a Mercer kernel if k j (x, y) are Mercer kernels, for all j = 1, . . ., g.In order for a multiscale wavelet kernel to be a Mercer kernel, the single-scale kernel based on a given mother wavelet function must be a Mercer kernel.A family of TI wavelet Mercer kernels often used in machine learning is Gaussian wavelet kernels described as follows.
Let ψ(x) = (−1) p C 2p (x) exp(−x 2 /2) be a Gaussian mother wavelet function, where C 2p (x) exp(−x 2 /2) is the 2p th step's differential coefficient of the Gaussian function, then the TI Mercer kernel using this Gaussian mother wavelet function is Different values of p give different Gaussian mother wavelet functions.In particular, when p = 0, C 2p (x) = 1, then this Gaussian wavelet function is a Gaussian function, and when p = 1, C 2p (x) = x 2 − 1, then this Gaussian wavelet function is the so-called Mexican hat mother wavelet [25].
Morlet mother wavelet has been recently used in signal classification and compression [26].We present a TI wavelet Mercer kernel based on the Morlet mother wavelet function because this mother wavelet as kernel has not been used in either support vector regression or SVM.The proof that this kernel is a Mercer kernel and the investigation of the performance of this kernel in KPCA are needed for using this type of wavelet kernel.Theorem 2. Morlet mother wavelet function is ψ(x) = cos(5x) exp(−x 2 /2).The Mercer kernel using this Morlet mother wavelet function is The proof of this theorem is provided in the Appendix.
In general, a single-scale kernel, for example, the Gaussian kernel, is a smooth kernel and thus may not be able to capture some local behaviors of data.Wavelet kernels are more flexible than other types of kernels, for example, polynomial kernels or the Gaussian kernel.This is why the mother wavelet functions are adopted as kernels.Moreover, multiscale wavelet kernels combine multiple single-scale wavelet kernels at different scales.They are more flexible than single-scale wavelet kernels because both large and small scales are used in the kernel functions.

Computation of Multiscale Wavelet Kernels
In this section, we discuss the computational issue of kernel matrix of multiscale wavelet kernel that needs to be addressed for KPCA.For a given data set, {x 1 , . . ., x n : x j ∈ R d for all j}, we first calculate the sample standard deviation of the data with coordinate number l, denoted by σ l , for l = 1, . . ., d.The data with coordinate number l are then divided by σ l to remove the potential effect of different scales of the observations.Before PCA is applied, a kernel matrix K obtained from either the dot-product type of function (7) or the translation invariant type of function is computed.In the computation of the kernel matrix of the MDWK described in (7), the values of a j and b k and their indexes j and k are selected in this paper as follows.The values of a j are in powers of 2, that is, a j ∈ {1, 2 0.25 , . . ., 2 0.25 j , . . ., 2 0.25(J−1) } for a given level J, which is 6 in this paper.
For each a j , the sequence b k is selected as b k = ku 0 a j , as suggested in [27].Here, u 0 controls the resolution of b k and is set to be 0.5.The range of k is the set {0, 1, . . ., 10} which is determined by the border of the mother wavelet function used in this paper.For the MTIWK, one does not need to specify the values of b k , and the values of a j are chosen to be the same as the ones in the MDWK.The multiscale kernel functions are constructed via a semiparametric method because we do not calibrate the kernel parameters.Instead, we use the dilated and translated versions of a mother wavelet function, with the parameters in powers of 2. In this paper we used the following mother wavelet functions, Gaussian mother wavelet function, Morlet mother wavelet function, and Mexican hat mother wavelet function.
As we said earlier in kernel-based methods, it is important that the constructed kernel matrix is positive semidefinite.The kernel matrix K based on the SDWK defined in (9) is always positive semidefinite [3].The MDWK defined in (7) is also a Mercer kernel as the linear combination of Mercer's kernels is a Mercer's kernel [19].A single-scale TI kernel is a Mercer kernel if it satisfies the Fourier condition [3], which implies that the kernel matrix is positive semi-definite.In order that a multiscale TI wavelet kernel is a Mercer kernel, the single-scale TI kernel based on a given mother wavelet function must be a Mercer kernel.The Gaussian kernel and Mexican hat kernel are Mercer kernels.Therefore, the multiscale TI Gaussian kernel and the multiscale TI Mexican hat kernel are Mercer kernels.

Simulation Experiments
The purpose of simulation experiments is to explore the performance of our proposed method in applications to noisy multiscale data under different levels of the signal to noise ratio.

Simulation Design
6.1.1.Clustered Data.We consider two-class twodimensional clustered data, denoted by D = {x i , y i : i = 1, . . ., n}, where x i = (x i,1 , x i,2 ) represents the data of Cluster 1, y i = (y i,1 , y i,2 ) represents the data of Cluster 2, and n is the total number of data points of each cluster.The simulation model is given by the following expressions: where (x 0,1 , x 0,2 ) and (y 0,1 , y 0,2 ) are the coordinates of the centers of Cluster 1 and Cluster 2, respectively; σ r x and σ s y are the signal-to-noise ratio of each dimension of Cluster 1 and Cluster 2, respectively.The added underlying noises e i ∼ N(0, 1), and are independent and identically distributed for both clusters.

Data Classification.
In this section, we discuss the results on how the WKPCA method performs in data classification.
As we aim for linearly separable features, we apply the linear classifier, that is, Fisher linear discriminate (FLD), for our classification problems, to see if linearity of data is improved after feature extraction.The feature extraction methods by PCA, by single-scale WKPCA with respect to different values of kernel parameter a, and by multiscale WKPCA are considered.The Gaussian function, the Mexican hat mother wavelet function and the Morlet mother wavelet function are used for constructing kernels.Also, the following set of values of the kernel parameter a, that is, a ∈ {1, 2 0.25, , 2 0.5 , 2 0.75 , 2, 2 1.25 }, is selected for the single-scale WKPCA.The multiscale wavelet kernels are constructed using all the values of a j belonging to the set {1, 2 0.25, , 2 0.5 , 2 0.75 , 2, 2 1.25 } for all multiscale wavelet kernels.In order to evaluate the performance of the feature extraction methods, the average classification accuracy rate of the single-scale WKPCA, which is calculated over all values of the parameter a used in the single-scale WKPCA, is compared to both the multiscale WKPCA and conventional PCA.

Homogeneous Clustered Data.
The training data and the test data are simulated using the simulation model described in Section 6.1.1.The values of the model parameters for simulating both of the training data sets and both of the test data sets are as follows: x 0,1 = 0, x 0,2 = 5, y 0,1 = 4, y 0,2 = 0, and n = 100.In the case of σ r x = σ s y , the simulated clustered data are homogeneous between the clusters.We consider 25 different values of σ r x .Each pair of σ r x and σ s y is denoted by (σ r x , σ s y ), where for r = 1, 2, . . ., 25 and s = 1, 2, . . ., 25, for simulating the training data and the test data.The values of σ r x and σ s y , are taken as σ 1 x = σ 1 y = 0.1, σ 2 x = σ 2 y = 0.3, σ 3 x = σ 3 y = 0.5, . .., and σ 25 x = 5, respectively.Figures 1(a) and 1(c) show the classification accuracy rates for the PCA method and for WKPCA method with different choices of the types of wavelet kernel and with respect the different values of σ x .In Figure 1(a), the feature extraction by the conventional PCA in data classification of the simulated homogeneous clustered data performs similarly as of the feature extraction by WKPCA method.Although 20 extended features are used for classification, feature extraction by WKPCA methods do not improve the classification accuracy rates in homogeneous clustered data classification.This result implies that the KPCA-based feature extraction method does not enhance the accuracy of the data classification when the kernelbased feature extraction method plus a linear classifier method are applied to linear separable data.The PCA and WKPCA perform similarly in the case of using Mexican hat kernel.Figure 1(c) shows that the feature extraction methods by PCA and MTIWK PCA have similar performance in data classification, however the feature extraction method by multiscale dot-product KPCA has worse performance 6 ISRN Computational Mathematics than either the method by PCA or MTIWK PCA.The single-scale wavelet KPCA has the worst performance in data classification.With the increase of data variation, the MTIWK PCA behaves more robustly as a feature extraction method because it has the best performance among the other methods.

Heterogeneous Clustered Data.
From the discussion in Section 6.2.1, we notice that WKPCA as a feature extraction method does not outperform the conventional PCA method for homogeneous clustered data.For some wavelet kernels, for example, SDWK or MDWK based on the Morlet mother wavelet function, the WKPCA as the feature extraction method performs worse than the conventional PCA.This is because (1) the data in each coordinate of the clusters appears to be approximately single-scale, thus the conventional PCA as the feature extraction method becomes appropriate for this type of data; (2) the homogeneous clustered data can be treated as linearly separable data with large data variation.Therefore using a nonlinear method of feature extraction does not enable an improvement of the performance of feature extraction.From our experiments, we observe that the performance of the WKPCA with Mexican hat kernel is approximately equal to the PCA.
In order to demonstrate the application of WKPCA as a feature extraction method in the classification of multiscale data, we simulate the training data and the test data using the simulation model described in Section 6.1.1.To simplify the problem, we fix the value of σ s y to be 5 for s = 1, 2, 30 and take different values for σ r x .The rest of values of the model parameters remain the same as those of Section 6.2.1, except the values of σ r x , which are taken as σ 1 x = 0.1, σ 2 x = 0.2, . . ., and σ 30 x = 3, respectively.In Figure 1(b), one can see that the feature extraction method by the conventional PCA for heterogeneous clustered simulation data has worst performance, and the feature extraction method by the multiscale WKPCA performs better for the same simulation data.Also, for the data sets with σ x larger than 1.5, MTIWK in KPCA performs better than MDWK.The average classification accuracy rates (in green) when STI wavelet kernel with a = 1, 2 0.25 , 2 0.5 , 2 0.75 , 2, 2 1.25 , respectively, is used in WKPCA, are all lower than those when the multiscale wavelet kernel is used in WKPCA.In Figure 1(d), MTIWK in KPCA is more robust as a feature extraction method than MDWK PCA and SDWK PCA.The conventional PCA is even better than WKPCA with Morlet dot-product kernel, that is, MDWK and SDWK.

Performance Evaluation Based on Monte Carlo Simulation.
The multiscale WKPCA with TI kernels outperforms the conventional PCA, STIWK PCA, and MDWK PCA.However, the results of classification accuracy rates are based only on one training data set and one test data set for each pair of (σ x , σ y ).In order to further evaluate the performance of WKPCA as the feature extraction method, we use the Monte Carlo simulation method to estimate the average classification accuracy rates and their sample standard deviation using the simulation model presented in Section 6.1.1.
The values of σ x are taken as 0.1, 0.3, . .., and 2.9, with the other model parameters remaining the same as the ones in Section 6.2.2.For each simulation model setup with a different value of σ x , the average classification accuracy rate and its sample standard deviation are computed for different types of kernel.Choice of feature extraction method is made from the following: PCA, the multiscale WKPCA with either the Gaussian kernel or the Mexican hat kernel, and the single-scale WKPCA with either the Gaussian kernel or the Mexican hat kernel, and with different values of parameter a of the kernel (i.e., a = 1, 2 0.25 , 2 0.5 , 2 0.75 , 2, 2 1.25 ).In the case of the multiscale WKPCA, both the dot-product kernel and the TI kernel are considered.Only the TI kernel is investigated for single-scale WKPCA.For each simulation model setup, m = 100 simulations are run, each having a different value of the random seed, to produce m training data sets and m test data sets.
Besides a choice of the kernel and the determination of kernel parameters, feature dimension is also an important issue.The classification accuracy for a given data set may depend on a choice of feature dimension, requiring an investigation of how classification accuracy is related to the feature dimension.Estimates of the average classification accuracy rate and its sample standard deviation are obtained by applying the Monte Carlo method.The results of the average classification accuracy rates for a different number of retained features are reported in Figures 2(a

Application to Epileptic EEG Signal Classification
In     The problem we consider is the classification of normal signals (i.e., set A) and epileptic signals (i.e., set E). Since we deal with extremely high dimensional data (i.e., d = 4097), in order to make our classification task be computationally efficient, we first extract the signal features by calculating the wavelet approximation coefficients of each signal in data sets A and E. These signals are normalized before applying the wavelet transform using the Symlet 8 wavelet.We use the high-level wavelet decompositions due to the concerns of sparsity and the goal of obtaining high signal discrimination power.The samples of extracted features are shown in Figure 4.Note that the coefficients of wavelet approximation around the two edges do not provide useful information for signal classification as those are affected by the edges of signals when the wavelet transform is applied.Only the coefficients of wavelet approximation within the central portion are considered as they are not affected by signal boundaries and have higher discrimination power compared to those around the edges.For example, at decomposition level 10, we obtained a set of three-dimensional features as the input of kernel PCA.Such low-dimensional feature in wavelet domain may not be sufficient to capture signal time-variability.Therefore, we consider additional two cases, that is, level 9 and level 8 wavelet decompositions.We select 7 and 11 features, which are corresponding to the wavelet approximation coefficients that ranges from the 8 to 14 and from 10 to 20 (within the central portion), respectively, for level 9 and 8 wavelet decompositions.We did not try a smaller level than 8 as it gives a very high dimensional feature set and the selection of features becomes difficult for those small levels.
As the input signals are normalized before the wavelet analysis, we eliminate the differences of the signal energy between groups.The high variability of extracted features reflects a high signal variability in original time domain.As we can see, the extracted features of normal signals are more fluctuated than the epileptic ones.This fact is coincided with the clinical findings about the rhythms of epileptic signals, which are more regularly fluctuated, that is, tends to be more deterministic.For all three cases that we considered, the WKPCA coupled with different kernels is applied to the wavelet approximation coefficients of signals and up to 20 principal components are extracted from WKPCA.The obtained results of classification accuracy, using different types of wavelet kernel and simple classifiers, are reported in Figure 5.As the high level of wavelet decomposition can only capture a very fine version of the signal, a level 10, which only gives the three-dimensional features, is not enough to capture signal time variability among groups.Although signal features are extended in PC space, it is important to retain the discriminative features from the original signals.Our study suggests that a level that slightly smaller than maximum allowed level is necessary to balance the trade-off between the classification performance and the sparsity of the input feature vector.The results shown in Figure 5 also suggest that classification performance for this considered data set does not obviously depend on the choice of kernel.Among all three cases considered, the best performance is obtained by using TI WKPCA with FLD classifier, which confirms our findings on the improvement of linearity of features using multiscale wavelet kernels.Thus, a non-linear classifier such as 1-NN may not necessarily outperform a linear classifier like FLD when WKPCA is used as a feature extraction method.This is because the linearity was improved by using the WKPCA and the 1-NN classifier performs better for clustered features than linear features.The classification accuracy is less affected by the feature dimension in PC space when the FLD classifier is used.However, TI WKPCA with 1-NN classifier achieves a higher accuracy when low-dimensional features are used for classification.This may suggest that it is beneficial to use the classification scheme that uses the multiscale wavelet KPCA plus a simple classifier including both linear and non-linear.The considered example demonstrates the applicability of the proposed method to multiscale data, and the proposed method could serve as an alternative approach to non-linear signal classification problems.

Conclusion and Discussion
This paper introduced a wavelet kernel PCA, in order to better capture data similarity measures in the kernel matrix.Multiscale wavelet kernels were constructed from a given mother wavelet function to improve the performance of KPCA as the feature extraction method in multiscale data classification.Based on analysis of the simulation data sets and the real data, we observed that the multiscale translation invariant wavelet kernel in KPCA has enhanced performance in feature extraction.The multiscale method for constructing a wavelet kernel in KPCA improves the robustness of KPCA in data classification as it tends to smooth out the locally modulated behavior caused by some types of mother wavelet.The application to real data was demonstrated through an EEG classification problem and the obtained results show the improvement of linearity after applying the multiscale WKPCA.Therefore, a simple linear classifier becomes suitable for classifying extracted features.This work focused on two important aspects: the first one was the construction of Mercer type wavelet kernel for kernel PCA and the second one was the investigation of the applicability of the proposed method.
The multiscale wavelet kernels proposed for the application in KPCA may also be useful for other kernel based methods in pattern recognition, such as support vector regression, kernel discriminant analysis, kernel density estimation, or curve fitting.Many kernel-based statistical methods require the optimization of kernel parameters, which is usually computationally expensive for high dimensional data.Because of this, the use of multiscale kernels is impractical as the computational cost is dramatically increased with the increase of the number of kernel parameters of a multiscale kernel.Instead, multiscale wavelet kernels enable to narrow down the search for the values of kernel parameters.This is because a linear combination of a set of multiple kernel functions constructed from a mother wavelet function is    considered in this approach.It aims at capturing the multiscale components of the data.However, since the multiscale wavelet kernels are nonparametric, the performance of the kernel based methods using the multiscale wavelet kernels may not lead to an optimal solution to the problem.

A. Proof of Theorem 1
Let x 1 , . . ., x n ∈ R d and r 1 , . . ., r n ∈ R. It is sufficient to prove that the kernel matrix K is positive semi-definite.Since Therefore, the kernel defined in ( 7) is a Mercer kernel.This completes the proof.

B. Proof of Theorem 2
Proof.By the Fourier condition theorem in [3], it is sufficient to prove that Before we prove this fact, we first introduce the complex Morlet wavelet transform for a given signal s(t).It is generally depicted as follows [28]: Therefore, the TI wavelet kernel constructed using the Morlet mother wavelet function is a Mercer kernel.This completes the proof.
Morlet kernel and heterogeneous clustered data

Figure 1 :
Figure 1: Classification accuracy rates of the different types of feature extraction methods: PCA (blue plots), STIWK PCA (green plots), MTIWK PCA (red plots) and MDWK PCA (black plots), for clustered data.The green plots correspond to the mean value of the classification accuracy rates of STIWK which are calculated over all values of the parameter a used in the STIWK.
)-2(f).The results of the change in behavior of the sample standard deviation for the average classification accuracy rate are presented in Figures 3(a)-3(f).Data classifications using both the FLD classifier and the feature extraction by PCA plus the FLD classifier have the worst performance for the simulated heterogeneous clustered data.Data classification using the multiscale WKPCA as feature extraction method shows the best performance.The feature extraction method using the multiscale WKPCA is less affected by the data variances than the feature extraction methods by PCA and the single-scale WKPCA.
Mexican hat kernel and 20 features

Figure 2 :
Figure 2: Average classification accuracy rates when FLD classifier is used only (cyan plots) and when a feature extraction method plus FLD classifier is used for 100 Monte Carlo simulations.The considered different types of feature extraction methods are PCA (blue plots), STIWK PCA (green plots), MTIWK PCA (red plots), and MDWK PCA (black plots) for the data simulated with different values of σ x .
Standard deviation (f) Mexican hat kernel and 20 features

Figure 3 :
Figure 3: Sample standard deviations of average classification accuracy rates when FLD classifier is used only (cyan plots) and when a feature extraction method plus FLD classifier is used for 100 Monte Carlo simulations.The considered different types of feature extraction methods are PCA (blue plots), STIWK PCA (green plots), MTIWK PCA (red plots), and MDWK PCA (black plots) for the data simulated with different values of σ x .

−
The ith wavelet coefficients (e) Set A, at level 8The ith wavelet coefficients (f) Set E, at level 8

Figure 4 :
Figure 4: The plots of coefficients of wavelet approximation of sample EEG signals at various wavelet decomposition level, that is, 10, 9, and 8.
Mexican hat and FLD TI Mexican hat waveletMexican hat and 1-NN TI Mexican hat wavelet and 1-NN and FLD (f) Mexican hat kernel, at level 8

Figure 5 :
Figure 5: The classification accuracy with respect to different numbers of principal components retained under Gaussian and Mexican hat TI wavelet kernels, using the coefficients of wavelet approximation at various decomposition level.The classifiers used are FLD and 1-NN.