EEG-Based Epilepsy Recognition via Multiple Kernel Learning

In the field of brain-computer interfaces, it is very common to use EEG signals for disease diagnosis. In this study, a style regularized least squares support vector machine based on multikernel learning is proposed and applied to the recognition of epilepsy abnormal signals. The algorithm uses the style conversion matrix to represent the style information contained in the sample, regularizes it in the objective function, optimizes the objective function through the commonly used alternative optimization method, and simultaneously updates the style conversion matrix and classifier during the iteration process parameter. In order to use the learned style information in the prediction process, two new rules are added to the traditional prediction method, and the style conversion matrix is used to standardize the sample style before classification.


Introduction
Due to the proposal of support vector machine (SVM) [1] and the development of related theories, the kernel method has become an effective method to deal with nonlinear fractional data. Since the performance of the classification algorithm depends largely on the representation of data, the kernel method uses relatively simple functional operations to map samples to higher dimensions, avoiding the design of feature space and complex inner product calculation in feature space. For example, in [2], a fast kernel ridge regression was proposed by using the kernel method. In the last decades, the kernel method has been applied in many fields of machine learning [3][4][5].
However, some data sets contain samples with uneven distribution, heterogeneous features, or irregular data; the single-kernel method using only a single feature space performs poorly. And since different kernel functions have their characteristics, even in the same application, the effect of using different kernel functions may be very different, which makes the selection of kernel functions and their parameters have an important influence on the performance of the algorithm. Since one kernel function often cannot meet the requirements in some practical application scenarios, multi-kernel learning that combines multiple kernel functions has been attracting more attention [6].
The combination generated by multikernel learning can be the combination of the same kernel function under different parameters or the combination of many different kernel functions [7]. After years of research, compared with single kernel function, multikernel learning has stronger flexibility, higher interpretability, and better performance in data dimension reduction [8], text classification [9], domain adaptation [10], and other fields.
Although the multikernel learning algorithm fully combines the mapping ability of different kernel functions for data, essentially, it only uses the physical characteristics of samples that include similarity and distance and fails to take into account the implicit information in the stylized data set in the real situation. In practical application, in addition to the representative content information, the data set often contains a variety of style information, and samples with the same style often exist in the form of groups. For example, there are two ways of dividing the letters shown in Figure 1(a), i.e., by the content shown in Figure 1(b) and by the font shown in Figure 1(c), where each font is regarded as a style, and such data is regarded as stylized data.
To mine the style information of data, scholars have done many types of research. The second-order statistical model proposed in the literature [11] is applied to the problem of number recognition, but it only has a good effect on the data subject to the Gaussian distribution, which leads to a great limitation in the application scenario of the algorithm. The bilinear discriminant model proposed in the literature [12] has achieved good results in behavior recognition data, but the computational cost of the algorithm is relatively high. The domain Bayesian algorithm proposed in the literature [13] improves the naive Bayesian algorithm to identify the style information in the sample group, but it needs to specify a clear data distribution type for the algorithm in advance. However, the distribution of data in real situations is often complex and difficult to be determined in advance. The algorithm proposed in the literature [14,15] uses a single mapping to mine the style information of samples and achieves excellent results in regression and classification problems, but it makes limited use of the physical characteristics of samples. The time-series style model of mining sample historical information proposed in the literature [16] and the bilayer clustering model of user's age and gender information proposed in the literature [17] effectively make use of the style information in the data in the unsupervised problem, but the algorithm is only targeted at specific fields, and the use of style information is limited. Inspired by the above scholars, we propose style regularization least squares support vector machine based on multiple kernel learning (SR-MKL-SVM) to excavate and utilize the physical similarities between sample points and the implied style information in samples. In addition to using the physical characteristics of each basic kernel function for data mapping to express the similarity between samples, the algorithm uses the style transformation matrix to represent and mine the style information contained in the data set and takes it into the objective function. In the training process, the alternate optimization strategy is used to update the style transformation matrix in addition to the classifier parameters, and the mined style information is used to synchronously update the kernel matrix. To use the sample style information obtained by training in the process of prediction, two new prediction rules are added on top of the prediction method of traditional multikernel least squares support vector machine. Because the style information contained in the sample is used effectively in the training and prediction process, the experiments of most of the stylized data sets show that SR-MKL-SVM is relatively recent and the classical multikernel support vector machine algorithm is effective.

Related Works
2.1. Multikernel Learning. Let x and z be two sample vectors; Φ is a mapping function from the input space to the feature space. If there is a function kð⋅ , ⋅Þ, which can be defined as then we call kð⋅ , ⋅Þ the kernel function. Multikernel learning expects to achieve better mapping performance by combining different kernel functions. There are many ways to combine [6] kernel functions. In this study, we use the following way to find a final combined kernel function based on M basic kernel functions k i ð⋅ , ⋅Þ. If we use μ i to represent the kernel function coefficient, then the final combined kernel function is formulated as where According to Mercer's theory, the combined kernel function generated by the above method still meets the Mercer condition.

Least Squares Support Vector Machine Based on
Multikernel Learning. Let D = fðx 1 , y 1 Þ, ðx 2 , y 2 Þ ⋯ , ðx n , y n Þg be the training sample set; x j ∈ R d and y i ∈ f+1,−1g are the label corresponding to x j . The objective function of the least squares support vector machine (LSSVM) proposed by Suykens [18] can be formulated as where Φðx j Þ represents the mapped x j in a high dimension, w and b are the classification hyperplane parameters, e j ðj = 1, 2, ⋯nÞ is the error term, and λ is the regularization parameter. The Lagrange multiplier α is introduced into Equation (4), and its dual form can be further obtained by the Slater constraint specification: where K ∈ R n×n is the kernel matrix. With (11), we can obtain the following two equations, whereK = K + I/λ and Y = ðy 1 , y 2 , ⋯, y n Þ T . By integrating K into (2) and (3), we can obtain multiple kernel least squares support vector machine (MK-LSSVM) as It is obvious that (8) is a semi-infinite linear program (SILP) problem, which can be solved by many existing mature optimization toolkits. For an unseen sample x, MK-LSSVM predicts it by using the following equation:

SR-MKL-SVM
Þg be a training set, where the set can be divided into N groups according to the style. The samples in each group have the same style, and the superscript t j is the number of samples in group j.
Under the above definition, the objective function of SR-MKL-SVM can be formulated as where μ i ði = 1, 2, ⋯, MÞ is the weight coefficient of the kernel matrix, where M is the number of predefined kernel matrices, fA j ∈ R d×d g is the style conversion matrix of the sample of style j, and I ∈ R d×d is the identity matrix. The first two subformulas in J SR−MKL−SVM are standard MK-LSSVM expressions, and the third subformula is a penalty term using the Frobenius norm, which is used to control the degree of style conversion of the style conversion matrix to the sample, where the parameter γ ∈ Rðγ > 0Þ is used. Obviously, when γ is larger, the deviation of the sample A T j Φðx k j Þ is smaller after style conversion from its original style; otherwise, it is larger; especially when γ → +∞ is set,  (6) to compute fa, bg 5.

3
Computational and Mathematical Methods in Medicine optimize the objective function. We can use the alternating optimization method to obtain a sufficiently available local optimal solution. When A j and fw i , b, μ i g are given separately, the objective functions are optimization problems about fw i , b, μ i g and A j , and the above two processes are repeated until convergence or the maximum number of iterations is exceeded. To be specific, (1) When fixing A j ðj = 1, 2, ⋯, NÞ, the optimization problem of formula (10) is transformed into The above formula is about the standard MK-LSSVM problem of sample A j Φ i ðx k j Þ after style conversion, and fμ i , w i , bg can be determined by Algorithm 1 in Section 2.2 of the article. At this time, the sample A j Φ i ðx k j Þ mapped to the high dimension cannot be directly calculated, but the synthetic kernel matrix formed by the style-converted sample A j Φ i ðx k j Þ can be updated by the kernel method to obtain the style-converted kernel matrix. The specific method of using the kernel method to obtain the styleconverted synthetic kernel matrix will be introduced in Section 3.3 of the article (2) When fμ i , w i , bg is fixed, then the optimization problem of Equation (10) is transformed into The above formula is a linear constrained quadratic programming problem for A j , which can be transformed into N independent problems for each A j to be solved. At this time, the parameters of the synthetic kernel matrix and the classifier have been fixed, similar to the original LSSVM, and the dual form can be obtained after introducing the Lagrange multiplier to Equation (12): Let ∂L/∂A j = 0; we have Let ∂L/∂e k j = 0 get α k j = λe k j . It can be seen that this formula has the same KKT [18] condition as LSSVM.
Through the process of alternating optimization, it can be known that in the process of training classifier parameters, the samples converted by the style conversion matrix are used as training data. In the first iteration, the style conversion matrix is initialized to the identity matrix. At this time, the samples after the style conversion are the same as the original samples, and no style conversion is generated. Therefore, the classifier parameters obtained by the first round of SR-MKL-SVM training are the same as the original MK-LSSVM. In the subsequent iteration process, due to the optimization of the style conversion matrix, the samples in each style group undergo the transformation of the style conversion matrix and gradually approach the standard style. The classifier parameters trained at this time fully consider the style information contained in the sample as a whole. At the same time, the process of solving the style conversion matrix from Equation (14) not only uses the physical characteristics of the samples obtained by training but also effectively uses the style information in the data. The style conversion matrix trained at this time contains each style group style information. According to the above analysis, the processes of training the classifier parameters and the style conversion matrix make full use of the style information contained in the sample, and the two processes promote each other.

Style Transformation.
Since the dimension after the sample is mapped to the high-dimensional space may be infinite, the sample A j Φðx k j Þ value after the style transformation cannot be obtained directly. At this point, each element in the synthetic kernel matrix can be updated with the help of the kernel method to obtain the synthetic kernel matrix after the style transformation.
Because the synthesis kernel function still has to satisfy the allowed kernel of the theorem, as kðx k1 j1 , x k2 j2 Þ = <Φðx k1 j1 Þ, Φðx k2 j2 Þ > = ∑ M i=1 μ i k i ðx k1 j1 , x k2 j2 Þ, let ϕ be for the synthesis of the combined map of the core matrix; by formula (9), you can make the synthesis of the core matrix kðx k1 j1 , x k2 j2 Þ =

Computational and Mathematical Methods in Medicine
j Φðx k j Þ; you can get after the style conversion of the core matrix elements aŝ where k ∧ ðx k1 j1 , x k2 j2 Þ is the core matrix element after style transformation and w = ∑ N j=1 ∑ t j k=1 α k j Φðx k j Þ; formula (15) can be updated tô Because of kðx k1 j1 , x k2 j2 Þ = <Φðx k1 j1 Þ, Φðx k2 j2 Þ > formula (16) can be updated to: 3.4. Algorithm. The training algorithm of SR-MKL-SVM is listed as follows. SR-MKL-SVM uses alternate optimization method to solve the problem, which can be divided into two steps. The first step is kernel matrix weight coefficient and classifier parameter optimized steps can be divided into two subprocesses, respectively, i.e., solving the kernel weight SILP problems and solving the linear programming problem of the classifier parameters for the synthesis of kernel matrix at the same time, the time complexity OðM 2 n 2 Þ and OðnÞ, respectively. Due to M ≥ 1, the total time complexity can be treated as OðM 2 n 2 Þ. The second step is to optimize the style-standardization matrix, and the time complexity of this step is OðN 2 n 2 Þ. Therefore, the total time complexity of the algorithm training process is Oðiter · ðM 2 n 3 + N 2 n 2 ÞÞ, where M is the number of predefined basic kernel matrices, N is the total number of samples, n is the number of styles in the data set, and iter is the number of iterations of the algorithm.
Compared with typical MKL-SVM, the MK-SRLSSVM algorithm is in the process of training in style transformation matrix to the regularization processing style samples, but the multikernel support vector machine (SVM) algorithm in solving the basic kernel function in the process of the weight coefficient is applied to solve the need to invoke the original SVM algorithm in this paper; using the original LSSVM subspaces, the SVM training process is essential in solving quadratic programming problem and the nature of the training process of LSSVM for solving linear programming problems. Therefore, the computational complexity of SR-MKL-SVM in this step is far less than that of the typical MKL-SVM algorithm. The algorithm presented in this paper optimizes the weight coefficient by solving SILP problems, which is superior to the support vector machine algorithm that optimizes the weight coefficient by solving SDP problems or QCQP problems and is comparable to the multikernel support vector machine algorithm that uses SILP and other problems to solve the weight coefficient. Therefore, SR-MKL-SVM has the same complexity as typical support vector machine algorithms.
Calculate the value of the objective function V iter 5.
Update A j by (14) and the synthetic kernel matrix by (11) 6. Until (Iteration count reaches its maximum) Algorithm 2.

Computational and Mathematical Methods in Medicine
A j ðj = 1, 2, ⋯, NÞ. Since the style of the sample may or may not appear in the training process in the practical application, two new prediction rules Rule 2 and Rule 3 are added into the traditional prediction method to deal with the two cases, respectively. Let X 0 = fx 1 0 , x 2 0 , ⋯, x t 0 0 g be a subset of the entire testing data set in which each element has the same style, and x t 0 0 ∈ R d ðk = 1, 2, ⋯t 0 Þ is a sample.

Rule 1. Traditional prediction method.
Traditional prediction methods only use weight μ i and classifier parameters w and b to predict the sample x k 0 in the testing data set and obtain the corresponding label y k 0 : Rule 2. Test sample style is known.
If the style of the test sample already exists in the training data set, the corresponding style transformation matrix acquired during the training process can be directly used to process the style transformation of the sample, so that the sample is close to the standard style. Then, the predicted label y k 0 was obtained by using traditional prediction rules for the processed sample A T 0 Φðx k 0 Þ.
k i ðx k j , x k 0 Þ can be obtained from Section 3.3.

Rule 3. Test sample style is unknown.
If the sample group X 0 's style does not exist in the training data set, to effectively make use of the style information obtained by training; based on the direct extrapolation idea, we consider the same style of the information contained in the sample group as a new style. The detailed steps are as follows: Step 1. Obtain the temporary label Y temp = fy 1 0 , y 2 0 , ⋯, y t 0 0 g of testing data set X 0 by using Rule 1.
Step 2. Train X 0 and its temporary label Y temp with the training data set to obtain the new weight b μ i , classifier parameter fŵ,bg, and style transformation matrix A T 0 .
Since that most of the data in real scenes contain implicit or obvious style characteristics, the new prediction method added in SR-MKL-SVM takes into account the situation of known style and unknown style. The style information corresponding to the predicted samples is directly used to predict the samples with known styles. The direct extrapolation method is used to predict the unknown style samples, and the trained style information is used effectively, so the algorithm has good universality.
3.6. Analysis of SR-MKL-SVM. Different from SVM, which only searches for the optimal classification hyperplane according to the physical distribution of the original data, SR-MKL-SVM not only considers the physical characteristics contained in the data but also mines the style characteristics of the data. In this paper, the whole training samples are used to optimize the classifier parameters and the data sets with    Computational and Mathematical Methods in Medicine different styles are processed, respectively. With the advantage of multikernel learning for data mapping, the algorithm in this paper can represent and process the data containing more complex styles and make full use of the trained style information to conduct style regularization processing on the original samples in both training and testing methods, so that the data distribution after style transformation can be more easily divided. Compared with traditional SVM and SR-MKL-SVM, we find that SR-MKL-SVM can make full use of the information contained in the stylized data to improve the classification performance.

Data.
In this section, we introduce the EEG data provided by Bonn University to evaluate our proposed method. The EEG data set consists of 5 groups of samples from 2 groups, with detailed information as shown in Table 1, and randomly selected samples from each group as shown in Figure 2. As can be seen from Figure 2, the fluctuation of samples from different groups is very different. For example, the signal fluctuation of patients in group A and healthy people in group E is significantly different. The signal fluctuation of patients in group C and group E also differed greatly under different conditions. Studies [19] showed that feature extraction of original EEG data in advance could effectively improve classification performance. In this paper, kernel principal component analysis (KPCA) [5,20] was used to extract features from original data. In this section, the data after dimension reduction is used for experiments. As can be seen above, the number of samples in the data set is 500, the number of categories is 2, and the sample dimension is 70. Samples from the same group are considered to have the same style.
In order to verify the validity of this algorithm, different groups of data are selected to form two types of data sets. The first type of data is all styles contained in the test set exist in the training set at the same time. The second type of data is the test set has a style not found in the training set, and the details of the construction data set are shown in Table 2. All data were random, and 10 experiments were conducted under the same set of parameters, averaging the results. Rule 2 and Rule 3 are used to predict the two types of data. The experimental results and parameters of all algorithms [21][22][23][24][25][26][27][28][29][30][31][32] are shown in Table 3.
From the experimental results in Table 3, it can be concluded that the decision tree algorithm in data set DS.1 has the best wave signal recognition effect, and the NLMKL algorithm in data set DS.2 has the best classification accuracy, leading all other algorithms including this algorithm. The results of this algorithm in the first two data sets are not as good as DT and NLMKL, but the difference is small.
From the above results, we can see the effectiveness and stability of the proposed algorithm in improving the accuracy of EEG signal recognition by mining and utilizing different fluctuation features contained in each group of samples.

Conclusion
In order to use the style information contained in the sample, this paper proposes a style regularization least squares support vector machine (SR-MKL-SVM) based on multicore learning. In addition to the advantage of multicore learning for the expression of physical similarity between samples, the algorithm also mines and uses the style information contained in the samples to improve the classification accuracy of the algorithm. SR-MKL-SVM takes the style information contained in the sample into the objective function, uses the style conversion matrix to standardize the sample, uses the regularization method to limit the degree of style conversion, and optimizes both the classifier parameters and the style standard during the training process conversion matrix. In addition to the traditional prediction methods, new

Data Availability
The original EEG data are available and can be downloaded from http://www.meb.unibonn.de/epileptologie/science/physik/ eegdata.html.

Conflicts of Interest
The authors declare that they have no conflicts of interest.