Multi-Nyström Method Based on Multiple Kernel Learning for Large Scale Imbalanced Classification

Extensions of kernel methods for the class imbalance problems have been extensively studied. Although they work well in coping with nonlinear problems, the high computation and memory costs severely limit their application to real-world imbalanced tasks. The Nyström method is an effective technique to scale kernel methods. However, the standard Nyström method needs to sample a sufficiently large number of landmark points to ensure an accurate approximation, which seriously affects its efficiency. In this study, we propose a multi-Nyström method based on mixtures of Nyström approximations to avoid the explosion of subkernel matrix, whereas the optimization to mixture weights is embedded into the model training process by multiple kernel learning (MKL) algorithms to yield more accurate low-rank approximation. Moreover, we select subsets of landmark points according to the imbalance distribution to reduce the model's sensitivity to skewness. We also provide a kernel stability analysis of our method and show that the model solution error is bounded by weighted approximate errors, which can help us improve the learning process. Extensive experiments on several large scale datasets show that our method can achieve a higher classification accuracy and a dramatical speedup of MKL algorithms.


Introduction
Real-world problems in computer vision [1], natural language processing [2,3], and data mining [4,5] present imbalanced traits in their data, which may be developed by the inherent properties of the data or some external factors such as sampling bias or measurement error. Unfortunately, most traditional learning algorithms are designed based on balanced data and target the overall classification accuracy, leading the minority class to be overwhelmed by the majority class. However, the minority class in these real-world problems is usually more important and expensive than the majority class.
In the past few decades, many algorithms have been proposed to solve the class imbalance problems [6][7][8]. e data-level methods artificially balance the skewed class distributions by data sampling [9,10]. e algorithm-level methods lift the importance of minority instances via the modification of existing learners [11,12]. However, there usually exist complex nonlinear structures in these real-world imbalanced data. In this case, the extensions of kernel methods for the class imbalance problems have been proven very effective [13][14][15]. In [16], Mathew et al. overcome the limitations of the synthetic minority oversampling technique (SMOTE) for nonlinear problems by oversampling in the feature space of the support vector machine. In [17], a kernel boundary alignment algorithm is proposed to adjust the class boundary by modifying the kernel matrix according to the imbalanced data distribution. e kernel-based adaptive synthetic data generation (KernelADASYN) for imbalanced learning is proposed in [18], which uses kernel density estimation (KDE) to estimate the adaptive oversampling density. However, with the development of data storage and data acquisition equipment, the scale of data continues to grow. e existing kernel-based class imbalanced learning (kernel CIL) methods suffer from serious challenges that the cost of calculating and storing a vast kernel matrix is very expensive.
A general technique for making kernel methods scalable is kernel approximation, of which the Nyström method is the most popular one [19]. e Nyström method constructs a low-rank approximation of the original kernel matrix from a subset of l ≪ n landmark points, where n is the data size. Computationally, it only needs to decompose a smaller matrix (denoted as W ∈ R l×l ). However, according to the approximation error bound O(n/ � l √ ) for the Nyström method in [20], there is a trade-off between accuracy and efficiency. e more landmark points sampled provide improved approximation accuracy but require more computing resources, which results in the rapid expansion of the subkernel matrix W as the data size increases and seriously affects the efficiency of the Nyström method.
Some works study the efficacy of a variety of fixed and adaptive sampling schemes for the Nyström method. For example, Musco et al. presented a new Nyström algorithm based on recursive leverage score sampling, which runs in linear time in the number of training points [21]. An ensemble Nyström method has been proposed to yield more accurate low-rank approximations by running mixtures of the Nyström method based on several subsets of landmark points randomly sampled [22]. However, the mixture weights of the ensemble Nyström method are defined according to the approximation error of each Nyström approximation, which may lead to the performance not as expected when applied to practical classification or regression applications. Recently, there emerges a fast and accurate refined Nyström-based kernel classifier to improve the performance of the Nyström-based kernel classifier [23]. Although the Nyström method has been studied extensively, there still exists a potentially large gap between the performance of learner learned with the Nyström approximation and that learned with the original kernel.
In this study, we propose a novel method, multi-Nyström, for large scale imbalanced classification. We incorporate the multi-Nyström method and multiple kernel learning to learn an improved low-rank approximation kernel superior to any one of each multi-Nyström approximation, where each approximation is defined by different kernel functions and subsets of landmark points. Moreover, unlike existing sampling schemes for the multi-Nyström method, our method selects subsets of landmark points according to the imbalance distribution to deal with the problem of skewed data. Without computing and storing the full kernel matrix, our method can scale to large scale scenarios. e main contributions of this study are summarized as follows: (1) We propose a multi-Nyström method to overcome the computational constraints of the Nyström method. Due to our method parallelized easily, it can generate more accurate approximates in large scale scenarios. (2) We optimize the mixture weights according to the data and the problem at the hand, so that the combined approximation kernel matrix can produce better performance. Moreover, the low-rank approximation can significantly speed up the existing MKL algorithms process. (3) We provide a stability analysis of our method, showing us the impact of kernel approximation error on the model solution and help determine the acceptable approximation error in the approximation of the kernel matrix.
e rest of this study is organized as follows. Section 2 introduces some related concepts. Section 3 then describes the proposed multi-Nyström approximation algorithm in detail. Experimental results and analysis compared with other algorithms are presented in Section 4. Finally, Section 5 summarizes the full work.

Kernel Methods.
Kernel methods such as support vector machines (SVMs) have become one of the most popular technologies of machine learning [24]. It can extend linear learners to nonlinear cases by introducing kernel trick. Consider a binary-class dataset D � (x i , y i ) n i�1 , where x i ∈ X ⊆ R s denotes an s-dimensional vector and y i ∈ +1, − 1 { } denotes its label. Define a nonlinear descriptor as e input data are mapped to a high-dimensional or even infinite-dimensional feature space, and the inner product in the feature space is calculated implicitly through the kernel function defined in the input space.
where K: R s × R s ↦ R is the kernel function that satisfies Mercer's theorem [25], and H is the corresponding reproducing kernel Hilbert space (RKHS). K can simply be a classical kernel like the radial basis function (RBF) kernel. Unfortunately, the kernel matrix K ∈ R n×n expands quadratically with the increase of data scale. e poor scalability limits the applicability of kernel methods in large scale scenarios.

Multiple Kernel
Learning. Due to different kernels corresponding to different similarity concepts or using features from different views, MKL can obtain more complete representations of the input data by combining multiple kernels. In MKL, each instance (x i , y i ) is mapped into different feature spaces by a series of descriptors [26]: where x m i represents feature from the m th view of instance x i , d m ≥ 0, m � 1, . . . , M is the corresponding weight, and M is the total number of predefined kernels. en, substitute any dot product term with kernels: Computational Intelligence and Neuroscience where each base kernel function K m (·, ·): R s × R s ⟶ R is a positive definite kernel associated with an RKHS H m . e purpose of MKL is to learn a resulting discriminant function of the form . Based on the aforementioned definition, the seminal work in MKL proposes the following structural risk minimization framework as MKL primal problem with kernel weights on a simplex [27].
where C is the regularization parameter of the error term. ξ is the slack variable. e L1-norm constraint on the weight vector d enforces the kernel combination to be sparse. We assume ‖f m ‖ 2 H m � 0 whenever d m � 0 in order to reach a finite objective. at implies if the weight of a certain kernel reaches d m � 0, stop the optimization of f m since the solution is known f m � 0 [28].
Although MKL is an ideal candidate for combining multiview data, scalability is a key issue for MKL: (1) the computation and memory costs for maintaining several kernel matrices are heavy and (2) the computational efficiency of MKL solvers is not high.

Standard Nyström
denotes a set of l landmark points randomly selected from D uniformly without replacement, C ∈ R n×l denotes the subkernel matrix between all instances and the landmark points, and W ∈ R l×l be a symmetric positive semidefinite (SPSD) subkernel matrix among the points in L. en, the Nyström method uses W and C to generate a rank-k approximation K k of kernel matrix K for k ≤ l [20]: where W k ∈ R l×l is the best rank-k approximation to W with respect to the Frobenius norm, that is, W k � argmin rank(V)�k ‖W − V‖ F , and W + k denotes the pseudoinverse of W k . Given the matrix W k , the feature of each instance x i can be evaluated as Calculate the singular value decomposition (SVD) of W as W � UΛU T , where U is the orthonormal and en, the final approximate decomposition of K is denoted as the following form: where Λ k ∈ R k×k is the diagonal formed by the top k singular values of Λ, and U k ∈ R l×k is formed by the associated singular vectors. e total time complexity of the Nyström method is [29]. For l ≪ n, it is much lower than the O(n 3 ) complexity taken by SVD on K.

Proposed Algorithms
When there are irregularities in the imbalanced data (such as small disjuncts, overlapping, and noise [30]) and the data scale is large, applying a single kernel may make the model biased, skew, or misleading. Inspired by the MKL algorithm [31], we construct a low rank approximate multiple kernel framework as follows: where K m,k corresponds to the rank-k approximation of each base kernel matrix K m , and d m is the corresponding mixture weight. As for the Nyström method, a key aspect is the sampling scheme [32]. For reducing the sensitivity to skewness in data, we adopt the stratified undersampling of the majority class to select M subsets of landmark points written as L � L m M m�1 with each L m � c m,1 , . . . , c m,l . e subkernel matrix between all instances and the landmark points can be expressed as where C m ∈ R n×l . en, we perform the standard Nyström method on each C m independently to get a rank-k approximation K m,k � C m W + m,k C T m of each base kernel matrix K m . Finally, by linearly combining these approximations, we can get the general form of approximation multiple kernel K: Given the mixture weight d m , the feature of each instance x i can be evaluated as Computational Intelligence and Neuroscience Similarly, for the convenience of subsequent calculations, formula (11) can be rewritten as where U m,k ∈ R n×k , and Λ m,k ∈ R k×k denotes the approximate decomposition of K m obtained by (8). Figure 1 shows the proposed multi-Nyström method and includes an optimization process of the mixture weights detailed futher in next subsection. When the mixture weight d m is fixed or known, the total time complexity of the multi-Nyström method is O(Ml 3 + Mnlk). Although our method requires M times more CPU resources than the standard Nyström method, M ≪ n is typically O(1) for large scale data, and our method can compute in parallel in the distributed computing environment. Moreover, the SVD on the subkernel matrix W is decomposed into that on M much smaller matrices would also accelerate the calculation process.

Optimization to Mixture Weights.
e purpose of MKL is to learn an optimal convex combination of a series of kernels during training. Based on the aforementioned definition, we propose an approximate multiple kernel learning framework for large scale imbalanced classification by modifying the original MKL framework in [26] min J(d) such that ‖d‖ 2 where where α is the Lagrange multipliers vector, and Y � diag(y 1 , . . . , y n ). To avoid numerical instability caused by ill-conditioning [19], we substitute K m,k ⟵ K m,k + σI, where σ is a small positive constant called jitter factor. Moreover, to calculate the inverse of the approximate matrix K − 1 and avoid storing the complete n × n matrix K, we iteratively perform the following series of operations: where T − 1 m is calculated using the SMW formula according to the last result T − 1 m− 1 . After performing the series of M + 1 operations, we can obtain

Lemma 1 (see [33]). Let A and C both be invertible; then, Sherman-Morrison-Woodbury (SMW) formula gives an explicit formula for the inverse of matrices
We can find that when the mixture weight is known, formula (15) is same as the dual problem of SVM. Hence, we have where α * is the optimal solution minimizing (15). With α * considered a constant in J, J can be regarded as a function of d, and we calculate the gradient of the objective J with respect to d m .
We use the reduce gradient method in [27] to deal with problem (14). First, for satisfying the L1-norm constraint on the weight vector d in (14), we calculate the reduced gradient of d: where ∇ red J denotes the reduced gradient of J(d In general, MKL uses a two-step training method. It requires frequent calls to support vector machine solvers, which is prohibitive for large scale problems. erefore, after each update on d, we are not eager to substitute it into support vector machine solvers to update α * , but continue to look for the maximum allowable step length in this descent direction until the objective function value stops declining. Finally, we get the optimal step length by the line search method.
e complete algorithm of the multi-Nyström method with MKL is summarized in Algorithm 1.

Kernel Stability Analysis.
In some previous related works, Nyström is usually considered as a preprocessing method and mostly only study the approximate error bounds without considering the impact of the approximate on the performance of the kernel machine. In the following, we analyze the kernel stability of our method, bounding the relative performance based on the weighted kernel approximation error. It provides performance guarantees for our multi-Nyström approximate method in the context of large scale imbalanced classification. Proposition 1. Let α * be the optimal solution for kernel SVM with kernel K and α be the solution of kernel SVM with kernel K obtained by Nyström approximation. en, where λ min is the smallest eigenvalue of K, and θ is the constant from Hoffman's bound independent on α * and α.
] + X be the projected gradient, where X is the bounded constraint and [x] + X ≡ argmin y∈X ‖x − y‖ is the convex projection operator. It can be used to define an error bound according to the following theorem: □ Theorem 1 (see [34]). Let x be the nearest optimal solution of the convex optimization problem: with g(t) being σ g strongly convex, ∇f(x) being ρ Lipschitz continuous, and X � x|Ax ≤ d { } is a polyhedral set. e optimization problem admits a global error bound: where θ is the constant from Hoffman's bound.

Considering now the problem min
Note that the above problem is equivalent to problem (15) with the equality K � CW + k C T (W + k is SPSD), and we have Let f be the dual objective function of multiple kernel learning problem (5) with the original kernel K � m d m K m , and f be the objective function of approximate multiple kernel learning problem (9) Figure 1: Architecture of the proposed multi-Nyström method. M subsets from the majority class are sampled to construct balanced landmark points and then the Nyström method is used to obtain the approximate base kernel matrices and the multiple kernel learning (MKL) algorithm is applied to optimize the mixture weights and train classifier. Finally, the trained kernel classifier based on multi-Nystrom is obtained.
Computational Intelligence and Neuroscience obtained by our multi-Nyström method (13). Consider now α * and α as the optimal solutions of f(α) and f(α), respectively. We have where we use the fact that ∇f(α * ) � 0 and ∇f(α) � 0; therefore, where ‖(K m − K m )‖ 2 is the spectral norm error of the m th Nyström approximate based on the m th subset of landmark points. Furthermore, we use the inequality ‖∇ + f(α * )‖ 2 ≤ ‖∇f(α * )‖ 2 of the kernel SVM given by [35] (proof of eorem 2) along with eorem 1 to upper bound the norm difference between the optimal solutions of f(α) and f(α): e proposition shows us the norm difference ‖α * − α‖ 2 is controlled by a weighted Nyström approximate error. And it guides us to focus on approximating the kernel matrices with greater weights for getting a better learning performance.

Experiments
In this section, in order to validate the efficiency of the proposed method in solving large scale imbalanced problems, we compare our method against kernel methods including SVM and MKSVM (multiple kernel SVM), as well as the Nyström approximation method. All experiments are implemented on a PC with Intel quad-core i7-8565U CPU@ 1.80 GHz and 8 GB memory.
where TP, TN, FP, and FN represent the number of truepositive, true-negative, false-positive, and false-negative instances, respectively. F1 score measures the classification performance on the minority class. G-mean reflects the overall classification performance. AUC works well for comparing performance between algorithms [36]. Table 2 provides the average experimental results of the proposed method and the other three algorithms on the four imbalanced datasets using the above three measures. We first compare SVM and the standard Nyström method. e Nyström method uses uniform sampling without replacement to approximate the kernel matrix, which relieves the model's sensitivity to class imbalance to a certain extent. For example, on the Poker-8-9_vs_5 dataset, in terms of G-mean, the Nyström method improves nearly 7 times more than SVM. However, we can also see that in terms of AUC and F1 score, there still exits a large gap in model accuracy as compared with SVM.

Experimental Results.
Next, we compare our multi-Nyström method with the standard Nyström method. e experimental results clearly demonstrate that our method outperforms the Nyström method, especially in the context of extreme imbalance. is mainly benefits from the use of undersampling of the majority class, which can effectively balance the class distribution. Moreover, it can be seen that multi-Nyström can improve the accuracy of the model. For example, with the same number of landmark points, the F1 score and AUC value of multi-Nyström on the USPS dataset are closer to that of SVM or even higher on Poker-8-9_vs_5 and Page-blocks0 datasets.
Note that our method is also a type of approximation of MKL, and finally, we also examine the performance of MKLbased MKSVM. From the results, we can see the effect of using MKL to represent input data, which also implicitly explains how our method achieves better accuracy at the expense of more computations.

Discussion.
In this part, we further discuss the impact of different parameters on performance. In the first experiment, in order to study the impact of the number of sampling landmark points on the classification performance, we fix the approximate rank parameter and successively increase the number of sampling landmark points, and then train and test the SVM model on four datasets, with results as shown in Figure 2. We can see that as the number of sampling landmark points increases, although there are some fluctuations, the performance of our method and Nyström still presents a rising trend. Moreover, except for few cases, our method uses fewer landmark points and can still yield higher G-mean.
In the second experiment, we study the performance with the variance of the rank parameter. Figure 3 shows the G-mean on four datasets by varying the approximate rank. ey show us that with the same approximate kernel rank, our method can achieve better classification performance than others.
Finally, we further compare the running time of our method and MKSVM. We report the results on two datasets USPS and Page-blocks in Figure 4. e results show that our method can significantly speedup the MKL process under guaranteed performance. For example, on the USPS dataset, our method can reduce the running time by more than one order of magnitude. e main reason is due to the low-rank attribute of the approximate kernel matrix that speeds up the MKL algorithm process.
For further analysis of the experimental results, we perform the Friedman test with respect to the F1 score. First, we calculate the average ranks of SVM, Nyström, multi-Nyström, and MKSVM as shown in Figure 5. It can be noticed that MKSVM gives the best performance. Meanwhile, the SVM and the proposed multi-Nyström rank similarly. In a comparison of k algorithms on N datasets, considering r i as the average ranking of the i th algorithm, the Friedman variable F F can be calculated as follows: with     Computational Intelligence and Neuroscience where F F is distributed to (4 − 1) and (4 − 1)(4 − 1) degrees of freedom. For our experiments, F F � 10.3333. e critical value of F (3,9) is 3.8625 for α � 0.05. Since F F > F (3,9), we can reject the null hypothesis that all the algorithms have the same performance. en, we perform the Nemenyi test to compare algorithms pairwise. e critical difference is calculated as follows: considering α � 0.05 and CD � 2.3452. e difference between the average ranking of the SVM, Nyström, and multi-Nyström with MKSVM is 1.0, 2.75, and 1.25, respectively. Hence, we can state that the best MKSVM is significantly better than Nyström at α � 0.05. However, the difference between the best MKSVM and the proposed multi-Nyström is not significant, which indicates the proposed method achieves better performance than the standard Nyström kernel classifier and more efficiency than the best MKSVM. Average rank

Conclusions
In this study, we propose a novel method to overcome the time and memory limitations of the standard Nyström method and extend it to the case of large scale imbalanced classification. In general, kernel approximation and model training are carried out separately. To obtain more accurate results, our method mixes multiple Nyström approximations and embeds them in the model training process to learn the model parameters and mixture weights simultaneously. In particular, the approximate kernel matrix yielded by our method is low rank and balanced. We also provide an error bound of the model solution based on our approximate method to guide us in improving the learning process. Experimental results show that our method can achieve a higher classification accuracy. On the other hand, it can dramatically improve the efficiency of exiting MKL algorithms.
Potential improvements: there are still some caveats in our current solution. For example, due to the curse of kernelization, the number of support vectors grows in an unbounded manner when suffered the nonzero loss. is significantly increases the computational cost and can be infeasible for large scale problems. Future work will chiefly focus on more efficient variants of multi-Nyström involving budget kernel learning to address the issue.

Data Availability
e data used to support the findings of this study have been deposited in the KEEL repository (http://keel.es/) and the LIBSVM archive (https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/).

Conflicts of Interest
e authors declare that they have no conflicts of interest.