Localized Simple Multiple Kernel K-Means Clustering with Matrix-Induced Regularization

Multikernel clustering achieves clustering of linearly inseparable data by applying a kernel method to samples in multiple views. A localized SimpleMKKM (LI-SimpleMKKM) algorithm has recently been proposed to perform min-max optimization in multikernel clustering where each instance is only required to be aligned with a certain proportion of the relatively close samples. The method has improved the reliability of clustering by focusing on the more closely paired samples and dropping the more distant ones. Although LI-SimpleMKKM achieves remarkable success in a wide range of applications, the method keeps the sum of the kernel weights unchanged. Thus, it restricts kernel weights and does not consider the correlation between the kernel matrices, especially between paired instances. To overcome such limitations, we propose adding a matrix-induced regularization to localized SimpleMKKM (LI-SimpleMKKM-MR). Our approach addresses the kernel weight restrictions with the regularization term and enhances the complementarity between base kernels. Thus, it does not limit kernel weights and fully considers the correlation between paired instances. Extensive experiments on several publicly available multikernel datasets show that our method performs better than its counterparts.


Introduction
Clustering is a widely used machine learning algorithm [1][2][3][4]. Multikernel clustering is one of the clustering methods which is based on multiview clustering and performs clustering by implicitly mapping sample points of diferent views to high dimensions. Many studies have been carried out in recent years [5][6][7][8][9]. For example, early work [10] shows that kernel matrices could encode diferent views or sources of the data, and MKKM [11] extends the kernel combination by adapting the weights of kernel matrices. Gönen and Margolin [12] improve the performance of MKKM by focusing on sample-specifc weights on the correlations between neighbors to obtain a better clustering, called localized MKKM. Du et al. [13] engaged the l 2,1 norm to reduce the uncertainty of algorithm results due to unexpected factors such as outliers. To enhance the complementary nature of base kernels and reduce redundancy, Liu et al. [14] employed a regularization term containing a matrix that measures the correlation between base kernels to facilitate alignment. Other works [15][16][17][18][19]are diferent from the original MKKM method [11] that prefused multiple view kernels. Tese methods frst obtain the clustering results of each kernel matrix, then fuse each clustering result in a later stage to obtain a unifed result.
More recently, a newly proposed optimization strategy, simple multiple kernel k-means (SimpleMKKM) [20] has emerged as a representative of multikernel clustering (MKC). Diferent from the normal MKKM algorithm, SimpleMKKM assumes minimization of kernel weights and maximization of cluster partition, which leads to min-max optimization that is somewhat difcult to unravel. It converts the optimization to a minimization problem and cleverly solves it with a specially designed gradient descent method rather than a coordinate descent method. However, it is established that the strict alignment of the combined kernel matrix can force the combination globally. Terefore, Liu et al. proposed [21] localized SimpleMKKM, which reduces the negative impact of distant samples on clustering by restricting the kernel alignment to the k-nearest neighbors of the sample rather than the global alignment. In this way, LI-SimpleMKKM can sufciently account for the variation between samples, improving clustering performance.
Although localized SimpleMKKM shows excellent performance on MKC problems, we fnd that the correlation between the given kernels is not sufciently considered providing an opportunity for improvement based on the listed problem statement.
(i) Te original method [21] makes the data stable by setting a larger weight η u in the gradient descent step and maintaining the summation and nonnegativity of the weights through the association with other weights. However, this idea only enhances the correlation between diferent view weights and η u and does not consider the relationship between view kernel matrices, especially between pairs. (ii) Te original method is possible to select multikernel kernels with high correlation for clustering simultaneously. Repeated selection of similar information sources makes the algorithm redundant and has low diversity, leading to the low ratio of diferent kernel matrices' efectiveness, ultimately afecting the accuracy of the clustering results.
Motivated by these, we propose a localized Sim-pleMKKM with matrix-induced regularization (LI-Sim-pleMKKM-MR) to improve upon the LI-SMKKM algorithm by adding an entry containing a matrix to measure the correlation between each two basis kernel matrices. LI-SimpleMKKM-MR algorithm can reduce the probability and simultaneously select high-correlation kernels, thereby enhancing the diversity of synthetic kernels and enhancing the complementarity of low-correlation kernels. Moreover, it adopts the advantage of localized SimpleMKKM, which has a better optimization efect that can be achieved by clustering the neighbor index matrix formed by the sample and the nearest k neighbors, and uses the optimization strategy min η − max H instead of min η − min H .
Compared with the original multiple kernel clustering, the proposed method optimizes kernel matrix weights by using gradient descent rather than coordinate descent, combined with localized sample alignment and kernel matrix induced regularization. Tis reduces the negative efects of forced alignment of long-distance samples and high redundancy and low complementarity of multiple kernel matrices.
We experimented with the algorithm on 6 benchmark datasets and compared it with the other nine baseline algorithms that solve similar problems through four indicators: clustering accuracy (ACC), normalized mutual information (NMI), purity, and rand index. We fnd that LI-SimpleMKKM-MR outperforms other methods. Tis is the frst work to fully consider and solve the correlation problem between the base kernels to the best of our knowledge.
Te contributions of this method are summarized as follows: (1) Proposed algorithm LI-SimpleMKKM-MR can productively deal with the alignment problem between kernel matrices using a regularization term, in order to reduce the redundancy, enhance the complementarity, and correlation between kernel matrices. (2) Te novelty is that our proposed method can be transformed into SimpleMKKM or LI-SimpleMKKM by adjusting the hyperparameters, making LI-SimpleMKKM-MR an extension of the previous two methods. (3) We conducted extensive experiments on 6 public multiple kernel datasets using 4 metrics. Te results show that our method achieves state-of-the-art performance compared to 9 existing baseline algorithms. Te experiments essentially validate our understanding of the previous problems and the efectiveness of the proposed solution.

Related Works
∈ χ be a set of n samples, and ϕ p (·): x ∈ χ⇒H p means mapping the features of the sample x of the pth view into a high-dimensional Hilbert space H p (1 ≤ p ≤ m). According to this theory, each sample can be represented by ] ⊤ means the weights of m prespecifed base kernels K p (·, ·) m p�1 . Te kernel weights will be changed according to the algorithm optimizing in the kernel learning step. According to the defnition of ϕ η (x) and the defnition of kernel function, the kernel function can be defned as follows: We can use training samples x i n i�1 by (1) to calculate a kernel matrix K η . Based on the calculation of K η , the objective function of MKKM with K η can be expressed as follows: Here, H ∈ R n×k means one soft label matrix, which is used to solve NP-hard problems caused by the direct use of hard allocation, which is also called the partition matrix. Moreover, I k means an identity matrix which is k × k in size.
Optimization of (2) can be divided into 2 steps: optimizing H or η and fxing the other one.
Te optimization of H of (3) can be easily solved by taking the frst k eigenvalues of the matrix K η .
According to the constraints, it can be easily solved by the Lagrange multiplier method [10].

MKKM with Matrix-Induced Regularization.
As (2) shows that η p only depends on K p and H. However, the interactions between diferent kernel matrices are not considered. Liu et al. [14] defned a criterion M(K p , K q ) to measure the correlation between K p and K q . A larger M(K p , K q ) means high correlation between K p and K q , and a smaller one implies that their correlation is low. By introducing the criterion term in (2), we can obtain the following objective function: where λ is a hyperparameter to balance clustering loss and regularization term.

Localized SimpleMKKM.
Unlike the existing min η − min H paradigm, SimpleMKKM adopts min η − max H optimization [20]. However, it is extended to make full use of the information between local sample neighbors and min η − max H optimization to enhance the clustering efect with a fusion algorithm called localized SimpleMKKM. Te objective value of LI-SimpleMKKM can be represented as follows: where m p�1 η 2 p K p and B (i) � N (i) N (i) ⊤ with N (i) ∈ 0, 1 n×round(τ×n) are the ith sample's neighborhood mask matrices; that is, only the samples closest to the target sample will be aligned. Tis new method is hard to solve with a simple two-step alternating optimization convergence method. To solve this problem, LI-SimpleMKKM frst optimizes H by a method similar to MKKM and then converts the problem into a problem of fnding the minimum with respect to η. With proving the diferentiability of the minimized formula, the gradient descent method can be used to optimize η [21].

Localized Simple Multiple Kernel K-Means with Matrix-Induced Regularization
According to Liu et al. [21], the relative value of η p is only dependent on K p , H, and K u , where u is the largest component of η. Only the weights of diferent kernels are linked, indicating that the LI-SimpleMKKM algorithm is not fully considered the interaction of the kernels when optimizing the kernel weights. Tis motivates us to derive a regularization term which can measure the correlation between the base kernels to improve this shortcoming.

Formulation.
Although the performance of clustering can be improved to some extent by aligning samples with closer samples, there is still room for further improvement of that algorithm.
To address this issue, we defne a criterion M(K p , K q ) to measure the correlation between K p and K q . A larger M(K p , K q ) means high correlation between K p and K q , and a smaller one implies that their correlation is low. We propose to add a matrix-induced regularization η T Mη based on LI-SimpleMKKM to improve the shortcomings, enhancing the kernel alignment between multiple kernels and reducing the redundancy of kernels with higher correlation. By fusing the regular term with (6), we can get the objective function as follows: where λ is a trade-of parameter to balance the loss of clustering problem and the regularization term on kernel weights. Te regularization term has many types, such as KL divergence and Hilbert-Schmidt independent criterion. In our proposed algorithm, we set M pq � Tr(K p K q ) for each element in M to measure the correlation between K p and K q . Choosing this method makes the calculation not too complicated and adopts the Hilbert-Schmidt independent criterion in disguise, which can refect the correlation between diferent base kernels to a certain extent.
Te incorporation of η T Mη use of the basic kernel better, thus improving clustering performance. Moreover, we can clearly see that if we set λ � 0, equation (7) is a special case of LI-SimpleMKKM.
Li et al. [22] use η ⊤ M (i) η instead of η ⊤ Mη as a regular term, where M (i) means a matrix with Although this method shows excellent performance, we fnd that the induced regularization of matrices should be global rather than local because the kernel alignment should be for the global kernel matrix. It can also be found from the experimental results in Table 1 that the global kernel-induced regularization has a better efect.

Alternate Optimization.
We design a two-step alternating optimization to solve the formula in (7).
(i) Optimizing H by η is fxed: fxed η, the optimization value with respect to H in (7) is represented as follows: Treating the summation (B (i) K η B (i) ) as a whole, (8) can be solved by solving for the eigenvalues of the matrix. (ii) Optimizing η by H is fxed: fxed H, the optimization value with respect to η in (7) can be represented as follows: We frst prove the diferentiability of (9), then calculate the gradient, and optimize η by the gradient descent method. Te frst part of the objective function in (9) is as follows: With the hyperparameter τ defned, we can regard B (i) K η B (i) as a whole, which is global kernel alignment and PSD [21]. For convenience, we let η . Tus, the function in (9) can be represented as follows: with Proof. For any given η ∈∆, the maximum of optimization problem Tr( According to theorem in [23], the former part of J(η) is diferentiable. By defning other elements in η except for p as s and the latter part of the J(η) as J 2 (η), the diferential of J 2 (η) � (λ/2)η ⊤ Mη can be expressed as follows: where p means one of the components of η and s means all of the other components so that (zJ 2 (η)/zη p ) � λ m q�1 η q M pq , and the whole J(η) in (12) is diferentiable.
We can solve this problem by designing a gradient descent method. After obtaining the gradient of J(η) under the premise of satisfying the equality constraints m p�1 η p � 1 and nonnegativity constraints η p ≥ 0 of η, we update η by gradient descent [23]. To implement this method, we let η u become a nonzero unit in η and ∇J(η) indicates the reduced gradient of J(η). Te pth (1 ≤ p ≤ m) element of ∇J(η) is presented as follows: and To improve numerical stability, we choose u as the largest unit in the vector η. Te nonnegativity constraint of η also needs to be considered during gradient descent.
To minimize J(η), we defne − ∇J(η) as a descent direction. However, if there is an index p corresponding to η p � 0, with [∇J(η)] p ≥ 0, the situation of η p < 0 may occur when the gradient is updated, violating the nonnegativity constraint. Under these circumstances, the descent direction for that unit p is set as zero. Tis makes η when the gradient is updated as follows: Te gradient update adopts the formula η←η + cd, where c is the step size. We determine the step size c by a one-dimensional linear search method, rather than setting it directly, and in order to ensure global convergence, this method has appropriate stopping criteria, for example, Armijo's rule [21]. □ Te specifc calculation steps of the algorithm in equation (13) are detailed in Algorithm 1.

Theorem 2. Te proposed algorithm is converged.
Proof. Note that for the kth iteration, Tr(H ⊤ T (i) η H) will be bigger than k + 1th iteration. In each iteration, the gradient of η p is smaller than 0 by equation (14) because u is the component of η, and in order to get the maximum of should be larger than other parts, so the diferential of it is bigger than others. Te component u has the gradient which is the opposite number of other component gradients' sum by the equation (15). According to the equation (16), the component of p will be bigger, while the coefcient of u will be smaller, and we can let ∆ as the diference of the kth iteration and k + 1th's, with u H as the largest part of each H ⊤ T (i) H, c is bigger than 0, it can be easy to get the conclusion ∆ is smaller than 0, because the non-negativity of η and kernel matrix, the former term has the lower bound 0 and convex, so the former term's convergence is been proofed.

□
As for the latter term (λ/2)η ⊤ Mη with the similar thought, it is also decreasing monotonically because M is a PSD matrix, η is not negative, and λ is bigger than 0; the second derivative of J 2 (η) can be easy to be calculated bigger than 0 (since each element of M is bigger than 0), so the latter term has the lower bound 0 and convex. At the same time, the whole equation (13) is monotonically decreasing and lower-bounded.

Computational Complexity Analysis.
We theoretically analyze the time complexity of the algorithm LI-Sim-pleMKKM-MR. We assume that n and m denote the number of samples and the number of base kernels. LI-Sim-pleMKKM-MR based on Algorithm 1 frst computes a neighborhood mask matrix with computational complexity O(n 2 log 2 n) and then computes the regularization term with computational complexity O(m 3 ). Terefore, the time complexity of LI-SimpleMKKM-MR is (n 3 + n 2 log 2 n+ m 3 ) per iteration.
Let us compare the complexity of LI-SimpleMKKM-MR and LI-SimpleMKKM. Since in most cases, the number of base kernels is much fewer than the number of samples (m ≪ n), compared with LI-SimpleMKKM (n 3 ), the time complexity of the proposed method does not increase signifcantly.
Te implementations of the comparison algorithms are publicly available in the corresponding papers, and we directly apply them to our experiments without tuning. Among the previous algorithms, ONKC, MKKM-MR, LKAM, LF-MVC, and LI-SimpleMKKM need to adjust hyperparameters. Based on the published papers and actual experimental results, we show the best clustering results of the previous methods by tuning the hyperparameters on each dataset.

Experimental Settings.
In all experiments, to reduce the diference between diferent views, all the base kernels are frst centered and then scaled so that for all i and p, we have For all the datasets, we set the number of clusters k according to the actual number of categories in the dataset.

Computational Intelligence and Neuroscience
We engage 4 indicators: clustering accuracy (ACC), normalized mutual information (NMI), purity, and rand index to measure the efect of clustering. To reduce the harmful efects of randomness, we initialized and executed all algorithms ffty times (50×) to obtain the mean and variance of the experimental indicators. Table 3 reports the ACC, NMI, purity, and rand index of the previously mentioned algorithms on all 6 datasets. Te following observations were made based on the results:
Te proposed algorithm adopts the advanced formulation and uses matrix-induced regularization to improve the correlation between kernel matrices, reducing redundancy and increasing the diversity of selected kernel matrices, making it superior to its counterpart.
Together, these factors make LI-SimpleMKKM-MR signifcantly improved over other algorithms on the same dataset. In addition, due to time complexity and memory constraints, the efect of LMKKM on some datasets has not been shown.

Parameter Sensitivity of LI-SimpleMKKM-MR.
We designed comparative experiments to study the infuence of the setting of two hyperparameters, localized alignment, and matrix-induced regularization, on the clustering efect. According to equation (7), LI-SimpleMKKM-MR tunes the clustering performance by setting two hyperparameters λ and τ, referring to the regularization balance factor and the nearest neighbor ratio.
We experimentally show the diference in clustering performance in λ and τ in all benchmark datasets. Figure 1 shows the ACC and NMI of our algorithm by varying one of τ or λ with the other one fxed. Based on these fgures, we can conclude that (1) as the value of τ increases, the ACC and NMI of each dataset increase to their highest value and, correspondingly, decrease when τ decreases and (2) by keeping the τ unchanged, the ACC and NMI will exceed SimpleMKKM and be steady when λ is small.
Hence, we conclude that our proposed algorithm presents a new state-of-the-art performance for clustering compared to other algorithms that only preserve the global kernel, such as LI-MKKM. Tus, it focuses on preserving the local structure of the data as specifc results are displayed in Table 1.
On top of min η − max H optimization, the clustering performance improves when the parameters are appropriately set by combining matrix-induced regularization and local alignment.

Convergence of LI-SimpleMKKM-MR.
In addition to theoretical verifcation, we experimentally verify the convergence of the algorithm. We present simulations of our  Computational Intelligence and Neuroscience Table  3: ACC, NMI, purity, and Rand index data of localized-SimpleMKKM-matrix-induced regularization with nine comparison methods on six benchmark datasets.
According to the results, the object value of the proposed algorithm oscillates frst, then decreases monotonically, and fnally converges in several iterations. Moreover, we know from experiments that most datasets can converge in fewer than 10 iterations. Tis result is comparable to the state-ofthe-art methods.

Performance of LI-SIMPLEMKKM-MR by Learned H.
We calculate the 4 clustering metrics at each iteration to show the variety of clustering performance variations of the learned H in diferent datasets and plot them in Figure 3. As observed, the clustering performance increased frstly with iterations and remained stable after oscillation.

Running Time of LI-SimpleMKKM-MR.
We report the running time comparison of all the baseline algorithms and LI-SimpleMKKM-MR on diferent datasets in Figure 4. With the analysis of the time complexity in Section 3.3 and the experiment result from Figure 4, even though there are additional computational steps, we found that LI-Sim-pleMKKM-MR does not signifcantly increase in computation time.

Conclusion
Although LI-SimpleMKKM can address the task of multiple kernel k-means in a min η − max H optimization and realize the local alignment, it does not sufciently account for the correlation between the basis kernels. Tis work proposes an LI-SimpleMKKM-MR algorithm that combines the sample localized alignment and matrix-induced regularization to solve this problem. Teoretically and experimentally, our method has demonstrated the best performance in clustering optimization and outperforms existing algorithms. In further research, we will apply this algorithm to incomplete MKKM problems.

Disclosure
A preprint has previously been published [26].

Conflicts of Interest
Te authors declare that they have no conficts of interest.