Local Similarity-Based Fuzzy Multiple Kernel One-Class Support Vector Machine

One-class support vector machine (OCSVM) is one of the most popular algorithms in the one-class classiﬁcation problem, but it has one obvious disadvantage: it is sensitive to noise. In order to solve this problem, the fuzzy membership degree is introduced into OCSVM, which makes the samples with diﬀerent importance have diﬀerent inﬂuences on the determination of classiﬁcation hyperplane and enhances the robustness. In this paper, a new calculation method of membership degree is proposed and introduced into the fuzzy multiple kernel OCSVM (FMKOCSVM). The combined kernel is used to measure the local similarity between samples, and then, the importance of samples is determined based on the local similarity between training samples, so as to determine the membership degree and reduce the impact of noise. The proposed membership requires only positive data in the calculation process, which is consistent with the training set of OCSVM. In this method, the noise has a smaller membership value, which can reduce the negative impact of noise on the classiﬁcation boundary. Simultaneously, this method of calculating membership has a higher eﬃciency. The experimental results show that FMKOCSVM based on proposed local similarity membership is eﬃcient and more robust to outliers than the ordinary multiple kernel OCSVMs.


Introduction
Anomaly detection is an important aspect of data mining. It is used to find objects in a data set that are significantly different from other data to achieve the purpose of preventing abnormal events. At present, the application of anomaly detection in the field of medicine and biological systems is of great significance, and it has been successfully applied to protein detection, [1] cancer screening, [2] and health monitoring [3]. e essence of anomaly detection is a classification algorithm suitable for processing data with an extremely imbalanced class. Complex biological systems usually have this feature. For example, the data of an infectious disease model may include characteristic data of patients and characteristic data of nonpatients. But, in real life, there are far more healthy people than patients. Timely and effective detection of patients with infectious diseases is an effective way to prevent the outbreak of infectious diseases. Support vector machine (SVM) [4,5] is a classical classification algorithm, but its performance will deteriorate when dealing with a one-class classification problem or distribution imbalance data. Among the solutions to oneclass classification problems, there are density estimationbased methods and support vector-based methods. e method based on support vector is popular because of its simplicity and high efficiency. ere are two models for this method: (1) one-class support vector machine (OCSVM) [6]; (2) support vector data description (SVDD) [7]. e goal of SVDD is to find a minimum hypersphere that contains all target samples. e main idea of OCSVM is to take the origin of the feature space as a representative of the abnormal data and then separate the target sample from the origin at the maximum margin. is paper focuses on OCSVM.
Like SVM, OCSVM is also sensitive to noise, which is due to the assumption that each sample has the same importance or weight during training. Introducing fuzzy membership in SVM and constructing the fuzzy support vector machine (FSVM) [8] is one of the effective methods to solve this problem. e calculation methods of fuzzy membership are mainly focused on two kinds of classification problems [9][10][11]. For example, the heuristic function derived from the centered kernel alignment is used to calculate the dependency relationship between the data point and its label to calculate the fuzzy membership [9] In [10], the membership degree of each sample point is determined by the lower approximation operator of a fuzzy rough set based on Gaussian kernel. As shown in [11], entropy is used to measure the class determinacy of samples. Samples with a higher class determinacy are assigned to a larger fuzzy membership. Generally, in order to improve the robustness of OCSVM, different weights are assigned to training samples, which are called weighted one-class support vector machine (WOCSVM) [12][13][14]. WOCSVM reduces the impact of noise by assigning lower weights to the noise [12]. As shown in [14], using prior knowledge to assign different weights to the samples, the weight is only related to the distribution knowledge of the neighbors, which is only determined by the k-nearest neighbors of the instance. In recent years, there are many studies on calculating sample weight in one-class classification problems [15][16][17][18]. As shown in [19], membership based on fuzzy rough set theory [20,21] is used as a weight in OCSVM. e above method improves the robustness of OCSVM to a certain extent but has some limitations. For example, when the amount of data is too large, it [14] is too inefficient when calculating sample weights. e authors in [19] use abnormal data when calculating the membership degree of the sample. In this paper, a novel strategy is proposed to solve the problem of poor robustness of OCSVM; that is, membership degree is introduced into the model. Different from the above membership calculation method, this method only uses one category of data, which fully adapts to the characteristics of OCSVM. e membership calculation method proposed is related to the local density of the data, which is obtained by the local similarity of the training data. We take an S-type function based on local density as a membership function.
OCSVM uses kernel trick to solve the nonlinear separability problem, but it also brings the problem of kernel selection. Multiple kernel learning method [22][23][24][25][26] is used to solve this problem, that is, the multiple kernel one-class support vector machine (MKOCSVM) [27]. e main work of this paper is as follows: (1) e multiple kernel learning and membership degree are introduced into OCSVM at the same time, and the fuzzy multiple kernel one-class support vector machine (FMKOCSVM) model is constructed to solve the problem of core selection and noise sensitivity. (2) A novel method of membership calculation is proposed, which is based on local similarity.
(3) We illustrate the effectiveness of this degree of membership in the figure. (4) According to the maximum similarity between the combination kernel and the ideal kernel, the weight coefficients of the multiple kernels are determined. (5) It is proved by experiments that the method proposed in this paper performs better than the method of fuzzy membership and the model without membership.
e combined kernel more fully characterizes the data than the single kernel. Using multiple kernel functions at the same time can solve the difficult problem of selecting the kernel function and its parameters, and this method can be applied to different sample information. e rest of this paper is organized as follows: Section 2 introduces the knowledge of OCSVM, MKOCSVM, and FMKOCSVM.
e formulation and algorithm of our FMKOCSVM based on local similarity are detailed in Section 3; and Section 4 reports the experimental results, followed by the conclusion in Section 5.

Related Information
2.1. One-Class Support Vector Machine. Compared with the SVM, OCSVM is suitable for dealing with the problem of data category imbalance or one-class classification. e main idea is to first map the data from the original space to the feature space through nonlinear mapping and then take the origin of the feature space as the representative of outliers to find an optimal classification hyperplane in the feature space, in which the image of normal data can be separated from the origin by the maximum margin. A graphical illustration is shown in Figure 1. Figure 1(a) shows the description of the classification in the original space. Figure 1(b) shows the description of the classification in the feature space.
Supposed the training samples x 1 , x 2 , . . . , x l ∈ R n (n is the dimension of x i ), where l is the number of training samples. ϕ(·) is a function that maps samples to feature spaces. Let ω, ρ denote normal vector and bias term for classification hyperplane in the feature space. e classification hyperplane is expressed as e goal is to maximize the distance between the classification hyperplane and origin. en, OCSVM needs to solve the following convex programming [6]: min w,ρ,ξ Here, ξ i is the slack variable, which means that outliers are allowed to exist, and v ∈ (0, 1] is a parameter to control the proportion of support vector and error points. Using the Lagrange multiplier method, the dual problem of the above optimization problem can be written as follows: (2) Because the above process needs to meet the KKT (Karush-Kuhn-Tucker) condition, e solution of equation (3) corresponds to sample x i , and there is always α i � 0 or ω T ϕ(x i ) � ρ − ξ i . When α i � 0, the sample x i has no effect on the hyperplane. When α i > 0, ω T ϕ(x i ) � ρ − ξ i must be true. In this case, this sample is called a support vector. If α i < 1/vl, then c i > 0; there must be ξ i � 0; that is, the sample is located on the maximum separation boundary; if α i � 1/vl, then c i � 0; in this case, when ξ i > 0, the sample x i is misclassified, which is called the boundary support vector. As shown in Figure 2, when different values are used, the positions of corresponding sample points are different.
Let N SV , N BSV represent the total number of support vectors and the total number of boundary support vectors, respectively. e maximum value of α i is 1/vl and has the following constraints: So we can have the following inequality: Multiplying ] on both sides of equation (5) gives It can be known from equation (6) that the value of ] determines the lower bound of the total support vector ratio and the upper bound of the boundary support vector ratio: e normal vector of the hyperplane can be obtained by using equation (7). Let x SV t denote the tth support vector located on the maximum spacing plane (0 < α t < 1/vl) and the bias term of the hyperplane obtained according to Feature space   us, the decision function can be written as For a given test sample x, after substituting x into equation (9), when f(x) returns +1, the sample is judged as a normal point; when f(x) returns − 1, the sample is judged as an abnormal point.

Multiple Kernel One-Class Support Vector Machine.
MKOCSVM replaces the single kernel function in the conventional OCSVM with a combined kernel, which can effectively avoid the difficulty in selecting the kernel function and its parameters. e forms of combined kernel function include linear combination and nonlinear combination, [28] expressed as Here, η m is the kernel weight of the mth basic kernel. e MKOCSVM model can be formulated as Here, To seek the optimal combination weight for each basis kernel, the authors in [27] suggest optimizing the maximum kernel-target alignment value of the combination kernel and the ideal kernel, that is, solving the following objective function [29]: Here, K m is the kernel matrix. 〈K 1 , K 2 〉 is the Frobenius inner product between two matrices, which is given by Only solution equation (16) is needed to obtain the optimal combination weight: Here, δ is a regularization coefficient.

Fuzzy Multiple Kernel One-Class Support Vector Machine.
Let s i denote membership degree of the sample x i , then the training set can be expressed as ( 1]. e FMKOCSVM needs to solve the following optimal programming: min ω,ρ,ξ When s i � 1, i � 1, 2, . . . , l, the FMKOCSVM degenerates to the normal MKOCSVM. After introducing the Lagrange multiplier α i ≥ 0, c i ≥ 0, for each inequality constraint, the Lagrange function of equation (17) is Setting the derivatives with respect to ω η , ξ, and ρ to zero, then we can obtain Substituting equation (19) into equation (18), the dual form of equation (17) can be written as Obviously, the only difference between the dual problems of MKOCSVM and FMKOCSVM is the upper bound of α i . e upper bound of α i becomes s i /vl in equation (20). e function f(x) can be written as In FMKOCSVM, when the noise has a lower membership during training, the negative effect of noise on the classification hyperplane can be reduced.

Training FMKOCSVM with Local Similarity-Based Membership
Noise in the training set may not belong to any class at all. erefore, if these samples with uncertainty are distributed near the edges of the target data, the model will overfit. To alleviate this phenomenon, this paper assigns membership to each training point, which makes the samples play a different role in training and reduces the negative impact of noise. In this section, we first introduce the calculation method of membership based on local similarity in detail and then propose the FMKOCSVM algorithm using membership based on local similarity.

Local Similarity-Based Memberships.
Assume the target sample x 1 , x 2 , . . . , x l ∈ R n . Let K η represent the multiple kernel matrix defined by K η � k η (x i , x j ) l×l , where the expression of multiple kernel function is equation (10).
Let all the elements of the upper triangle of the multiple kernel be sorted from large to small, i.e., K η � k η (x i , x j ) i < j , and then write it as a vector G � g 1 , g 2 , . . . , g t , t � l(l − 1)/2.
Next, define a constant ε ∈ [0, 1] and let h � [εt], where [εt] is the integral part of εt. θ � g h is a threshold. For each sample x, let u represent the total number of k ij ≥ θ, j � 1, 2, . . . , l. In other words, u i represents the number of samples in the target sample whose similarity with the sample x i is greater than or equal to the threshold. e kernel k η (x i , x j ) measures the similarity between two target samples x i and x j , and a large kernel indicates a large similarity. If the membership degree of the sample x i to the target class is higher, it is obvious that more samples are similar to the sample x i in the input sample, i.e., the greater the value of u i . In other words, a sample with a higher value of u should have a greater contribution to the classification boundary, the penalty for misclassification of such samples is greater, and the noise will have a smaller value of u.
erefore, we take u as a measure function, which measures the importance of the target sample to the classification hyperplane. Obviously, the value of u cannot be directly used as a membership degree of FMKOCSVM. We use an S-type function to map this measure into the membership degree in the unit interval. At the same time, this S-type function increases the difference between membership degrees of samples with different importance. e membership function is written as where U � max i u i . τ > 0 is a constant. e value range is (0, 1). Figure 3 describes the distribution of membership values by τ taking different values. According to Figure 3, when τ � 10, the distribution of membership value is the best. Algorithm 1 lists the detailed calculation process of membership based on local similarity.
Since noise has a low degree of membership to the target class, there are few similar instances in the input data. In other words, the noise will get a smaller membership value. erefore, we proposed a method that can make the noise have less influence on the classification boundary. More importantly, because the training data of OCSVM only include target samples, the traditional method of calculating membership degree is not suitable for OCSVM. However, our method of membership based on local similarity only uses the features of the target data and does not involve the information of class, which is very suitable for a one-class classification problem. Furthermore, the proposed method has obvious high efficiency.

Overall Procedure of FMKOCSVM Based on Local
Similarity. e detailed process of FMKOCSVM based on membership degree of local similarity is listed in Algorithm 2. In the following part of this paper, we use FMKOCSVM_LS to represent the proposed algorithm.
In Figure 4, the classification performance of MKOCSVM is shown in Figure 4(a) and the classification performance of FMKOCSVM_LS proposed by us is shown in Figure 4(b). e combined kernel function is composed of seven Gaussian kernels with a width of Parameter v is set to 0.02. e value of the regularization coefficient δ in equation (16) is set to 100. In order to reduce the cost of tuning extraparameter, we set ε to 0.2 and τ to 10 directly. Obviously, FMKOCSVM_LS has a tighter boundary than MKOCSVM. In Figure 4(b), some outliers are identified by FMKOCSVM_LS. However, MKOCSVM does not identify any outliers, and there are a lot of gaps inside the boundary.
After adding 10% Gaussian noise to the training set, the results are shown in Figure 5. e parameter settings in Figure 5 are the same as in Figure 4. As we can see, when there are noises in the training set, the classification ability of FMKOCSVM_LS is much better than that of MKOCSVM. In Figure 5(a), MKOCSVM distinguishes all noises as target data, which makes the performance of MKOCSVM very bad. In Figure 5(b), we can see that most of the noises have a very small membership value, and the negative effect of noise on the boundary is weak. erefore, FMKOCSVM_LS can improve the robustness of MKOCSVM.
In the next section, we will further prove through experiments that our proposed method of calculating membership is better than the previous method.

Approaches.
We compared FMKOCSVM_LS with the following methods: (1) MKOCSVM: the ordinary multiple kernel one-class support vector machine [27]; (2) Input: the training set x 1 , x 2 , . . . , x l , the kernel function set k 1 , k 2 , . . . , k P , the kernel parameter set σ 1 , σ 2 , . . . , σ P , the test set T Output: the classification results of T: y (1) Preprocess the training set (2) for m � 1 : P do (3) Calculate the corresponding kernel matrix K m of σ m (4) end for (5) Substituting the kernel matrix K m , m � 1, 2, . . . , P into equation (16) (6) Calculate the kernel weight vector η m , m � 1, 2, · · · , P according to equation (16) (7) Calculate the combined kernel matrix K η according to equation (10) (8) for each x i ∈ x 1 , x 2 , · · · , x l do (9) Calculate the membership degree s i of the sample x i according to equation (22 WMKOCSVM: the weighted one-class support vector machine is formed by WOCSVM [14] combined with the multiple kernel function; (3) FMKOCSVM: the fuzzy multiple kernel one-class support vector machine, in which membership is calculated based on a rough set [19]. Because two classes of samples are needed to calculate membership, the training set contains negative class samples. ese negative samples are only used to calculate membership.
In MKOCSVM, the parameter v is determined by 10fold cross validation, and the value range is 0.01, 0.04, 0.07, 0.10, 0.13, 0.16, 0.19 { }. e basic kernel of the multiple kernel function is seven Gauss kernel functions with kernel width 2 − 6 , 2 − 5 , 2 − 4 , 2 − 3 , 2 − 2 , 2 − 1 , 2 0 . e parameter δ in the multiple kernel learning algorithm is set to 100. WMKOCSVM, FMKOCSVM, and FMKOCSVM_LS also use these parameters during training. e number of nearest neighbors in WMKOCSVM is set to 10, which is the same as in [14]. In order to avoid increasing the time due to the adjustment of parameters, when calculating the membership based on local similarity in FMKOCSVM_LS, we directly set ε � 0.5 and τ � 10.

Metrics.
In this paper, the performance of different approaches is evaluated by three popular metrics, namely, g-mean, AUC, and training time. According to the confusion matrix in Table 1, we can get the true positive rate (TPR) and the false positive rate (FPR). In one-class classification problems, using g-mean and AUC as measures is more accurate than using accuracy:

Data Sets.
In this section, we selected 14 benchmark data sets, 13 of which are from the UCI machine learning repository. ere are three experiments on biological systems.
e Heart data set is a data set used for heart disease diagnosis. e Breast data set is a data set used to diagnose whether a patient's breast cancer is benign or malignant. e Biomed data set is used to screen whether it is a carrier. Creditcard_cut is a part of the data set of creditcard fraud detection on Kaggle. Because the original Creditcard data set is too large, we only randomly selected 729 transaction data (483 normal transactions and 249 fraudulent transactions) for the experiment. Table 2 lists the details of these data sets.
For each data set, we use 70% of the positive data as the training set. en, we randomly selected a part of negative data as the noise in the training set, and the proportion of noise was 10%. e rest of the data is used as the test set. e training set is normalized before training. And the test set is processed according to the standard of the training set.

Results.
In order to obtain stable results, each method has done 20 independent experiments on each data set. e result used in the comparison is the average of the 20 results. Table 3 shows the optimal value of v obtained through the 10-fold cross validation. In order to get the best results of the four algorithms, v in Table 3 is used in each experiment. Table 4 is a comparison of g-mean. Table 5 shows a comparison of AUC values. Table 6 shows the average training time of MKOCSVM, WMKOCSVM, FMKOCSVM, and FMKOCSVM_LS on each data set. Figure 6 is the total training time of 14 data sets of each method.

Complexity
From Tables 4 and 5, we can find that the performance of FMKOCSVM_LS is the best among the four algorithms, which proves that our membership method can improve the robustness of MKOCSVM. More importantly, WMKOCSVM and FMKOCSVM have only one best result on 14 data sets, respectively. However, our proposed method has twelve optimal performances.
On the Iris, Breast, and Wdbc data sets, FMKOCSVM_LS shows great advantages. Its g-mean is 27%∼32% higher than that of MKOCSVM, 10%∼18% higher  1  Australia  383  307  690  14  2  Balancescaleleft  288  337  625  4  3  Biomed  127  67  194  5  4  Glass  70  144  214  9  5  Heart  160  137  297  13  6  Vowel  48  480  528  10  7  Wine  48  130  178  13  8  Creditcard_cut  483  249  729  30  9  Japan  357  294  651  15  10  Iris  50  100  150  4  11 Breast      than that of FMKOCSVM, and 23%∼31% higher than that of WMKOCSVM. In the corresponding data set, the AUC value of FMKOCSVM_LS also increased significantly. On the Japan data set, although the g-mean of FMKOCSVM LS is lower than the g-mean of FMKOCSVM, it is still 10% higher than the g-mean of MKOCSVM and 4% higher than the g-mean of WMKOCSVM. AUC value is the same as g-mean. On the Glass data set, WMKOCSVM has the best result, and its result is only about 2% higher than that of FMKOCSVM_LS. However, on the Glass data set, the result of FMKOCSVM_LS is 5% higher than that of MKOCSVM, which proves that our membership calculation method can reduce the impact of noise on classification ability. In the remaining 9 data, the results of FMKOCSVM_LS are the best and have obvious advantages. For example, on the Waveform data set, the g-mean of FMKOCSVM_LS is 10% higher than that of WMKOCSVM and 4% higher than that of FMKOCSVM.
In terms of training time, although our method is not the fastest, FMKOCSVM_LS is still faster than WMKOCSVM. e training time of WMKOCSVM is 1.5 times of that of FMKOCSVM_LS on average. Compared with the training time of MKOCSVM, the increased training time of FMKOCSVM_LS is within the acceptable range.
All of the above proves that the MKOCSVM with membership is more robust. Moreover, our proposed membership based on local similarity is the best.

Conclusions
In order to solve the problem of poor robustness of MKOCSVM, this paper proposes a fuzzy multiple kernel one-class support vector machine based on local similarity, in which membership is based on the local similarity of the training data. Firstly, the similarity between samples is measured by combining the kernel matrix.
en, according to the selected threshold, the local similarity of each sample is determined. Finally, an S-type function is used to map the local similarity to the unit interval, and the function value is taken as the membership value. Experiments show that the membership method proposed in this paper can improve the robustness of MKOCSVM. Moreover, compared with the other two methods, our method is optimal. e difficulty in fuzzy multiple kernel one-class support vector machine lies in how to determine the effective membership. Compared with the previous membership calculation method, only the target data are needed to calculate the membership based on the local similarity of the data, which is consistent with the OCSVM training set. In this method, the noise or outliers are assigned a small membership value, which makes the noise have the weakest impact on the classification boundary. erefore, the membership method in this paper helps improve the robustness of MKOCSVM. In the next step, we will research the optimization method of parameters in the process of membership calculation based on local similarity.

Data Availability
e data underlying the study can be available upon request to the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.