^{1,2}

^{2}

^{1}

^{2}

There exist already various approaches to outlier detection, in which semisupervised methods achieve encouraging superiority due to the introduction of prior knowledge. In this paper, an adaptive feature weighted clustering-based semisupervised outlier detection strategy is proposed. This method maximizes the membership degree of a labeled normal object to the cluster it belongs to and minimizes the membership degrees of a labeled outlier to all clusters. In consideration of distinct significance of features or components in a dataset in determining an object being an inlier or outlier, each feature is adaptively assigned different weights according to the deviation degrees between this feature of all objects and that of a certain cluster prototype. A series of experiments on a synthetic dataset and several real-world datasets are implemented to verify the effectiveness and efficiency of the proposal.

Outlier detection is an important topic in data mining community, which aims at finding patterns that occur infrequently as opposed to other data mining techniques [

Recently the studies on outlier detection are very active and many approaches have been proposed. In general, existing work on outlier detection can be broadly classified into three modes depending on whether label information is available or can be used to build outlier detection models: unsupervised, supervised, and semisupervised methods.

Supervised outlier detection concerns the situation where the training dataset contains prior information about the class of each instance that is normal or abnormal. One-class support vector machine (OCSVM) [

Unsupervised outlier detection, without prior information about the class distribution, is generally classified into distribution-based [

Clustering-based approaches [

In many real-world applications, one may encounter cases where a small set of objects are labeled as outliers or belonging to a certain class, but most of the data are unlabeled. Studies indicate that the introduction of a small amount of prior knowledge can significantly improve the effectiveness of outlier detection [

Most of the previous research equally treats different features of objects in outlier detecting process, which does not conform to the intrinsic characteristic of a dataset. Actually, it is more reasonable that different features have different importance in each cluster, especially for high-dimension sparse datasets where the structure of each cluster is often limited to a subset of features rather than the entire feature set. Some works concerning feature weighted clustering have been studied. Huang et al. [

To make full use of prior knowledge to facilitate clustering-based outlier detection, we develop a semisupervised outlier detection algorithm based on adaptive feature weighted clustering (SSOD-AFW) in this paper, in which the feature weights are iteratively obtained. The proposed algorithm emphasizes the diversity of different features in each cluster and assigns lower weights to irrelevant features to reduce their negative influence on outlier decision. Furthermore, based on the convention that outliers usually have a lower membership to every cluster, we relax the constraint of fuzzy c-means (FCM) clustering where the membership degrees of a sample to all clusters must sum up to one and propose an adaptive feature weighted semisupervised possibilistic clustering-based outlier detection algorithm. The interaction problem between optimal clustering and outlier detection is addressed in the proposed method. The label information is introduced into the possibilistic clustering method according to the following principles: (1) maximizing the membership degree of a labeled normal object to the cluster it belongs to; (2) minimizing the membership degrees of a labeled normal object to the clusters it does not belong to; and (3) minimizing the membership degrees of a labeled outlier to all clusters. In addition to the above principles, we simultaneously minimize the dispersion within clusters in the new objective function of clustering to achieve a proper cluster structure. Finally the yielded optimal membership degrees are used to indicate the outlying degree of each sample in the dataset. The proposed algorithm is found promising in improving the performance of outlier detection in comparison with typical outlier detection methods in accuracy, running time as well as other evaluation metrics.

The remainder of this paper is organized as follows. Section

Let

FCM is a well-known clustering algorithm [

Afterward, another unsupervised possibilistic clustering algorithm (PCA) is proposed by Yang and Wu [

In this section, we introduce prior knowledge into possibilistic c-means clustering method to improve the performance of outlier detection. First, a small subset of samples in a given dataset

If an object

If

If

Usually data often contain a number of redundant features. The cluster structure in a given dataset is often confined to a subset of features rather than the entire feature set. Irrelevant features can only obscure the discovery of the cluster structure by a clustering algorithm. An intrinsic outlier is easy to be neglected due to the vagueness of cluster structure. Figure

Three-dimensional synthetic example.

Plot of the space (

Plot of the subspace (

Plot of the subspace (

In our research, let

The points within clusters usually behave strongly correlated, while weak correlation is shown between outliers. That is, normal points belong to one of the

The first term in (

The virtue of semisupervised indicator matrix

In this subsection, an iterative algorithm for minimizing

First, in order to minimize

By taking the gradient of

Then

Substituting (

It follows that

The updating criteria of feature weight

The updating way of

To find the optimal cluster prototype

The updating formula of cluster prototype

To solve the optimal fuzzy partition matrix

The updating formula of

Formula (

Based on the above analysis, outliers should hold low membership degrees to all clusters. Therefore, the sum of memberships of an object to all clusters can be used to evaluate its outlying degree. For a certain object

Thus, a small value of

In summary, the description of the SSOD-AFW algorithm is shown in Algorithm

Calculate the parameter

Compute the matrix of cluster prototype

Update the feature weight matrix

Update the feature weighted distance

Update parameter

Update the membership degree matrix

If

The outlying degree of each object is computed, and the OD values are sorted in an ascending manner. Finally output top

Computational complexity analysis: Step (2) needs

In this section, we discuss the convergence of the proposed SSOD-AFW algorithm. To prove the convergence of objective function

Objective function

Due to the fact that

Since

Objective function

Similar to Lemma

Since

Objective function

The proof of Lemma

Objective function

The objective function

Lemmas

Comprehensive experiments and analysis on a synthetic dataset and several real-world datasets are conducted to show the effectiveness and superiority of the proposed SSOD-AFW. We compared the proposed algorithm with two the-state-of-the-art unsupervised outlier algorithms, LOF [

For numerically performance evaluation of outlier detection algorithms, three metrics, namely, accuracy [

Let

The receiver operating characteristic (ROC) curve represents the trade-off relationship between the detection rate and the false alarm rate. In general, the area under the ROC curve (AUC) is used to measure the performance of outlier detection method, and the value of AUC for ideal detection performance is close to one.

For a given outlier detection algorithm, true outliers occupy top positions with respect to the nonoutliers among

RP reaches the maximum value 1 when all

A two-dimensional synthetic dataset with two cluster patterns is generated from Gaussian distribution to intuitively compare the outlier detection results of the proposed method against the other four algorithms mentioned above. The mean vectors of the two clusters are

Outlier detection results of different algorithms on the two-dimensional synthetic dataset.

Original synthetic dataset

LOF

SVDD

EODSP

SSOD-AFW

In Figure

Figure

Performance comparison of different algorithms on the synthetic dataset.

Accuracy

AUC

RP

Furthermore, during the experimental process shown in Figure

Outlier detection result of the nonweighted SSOD-AFW on the synthetic dataset.

For further verification of the effectiveness of the proposed algorithm, five real-world datasets from UCI Machine Learning Repository [

Description of real-world datasets.

Dataset | Instances | Features | Outlying classes | Outliers (percent) | Clusters | Prior information |
---|---|---|---|---|---|---|

Iris | 126 | 4 | “Virginica” | 26 (20.63%) | 2 | 10 labeled normal samples, 4 labeled outliers |

Abalone | 4177 | 8 | “1”–“4,” “16”–“27,” “29” | 335 (8.02%) | 11 | 11 labeled normal samples, 18 labeled outliers |

Wine | 130 | 13 | “3” | 11 (8.46%) | 2 | 9 labeled normal samples, 4 labeled outliers |

Ecoli | 336 | 9 | “omL,” “imL,” “imS” | 9 (2.68%) | 5 | 11 labeled normal samples, 3 labeled outliers |

WDBC | 387 | 30 | “Malignant” | 30 (7.75%) | 1 | 10 labeled normal samples, 8 labeled outliers |

We compare the outlier detection performance of the proposed algorithm with LOF,

Figure

Performance comparison of various algorithms on the real-world datasets.

Accuracy

AUC

RP

It is worth mentioning that the experiment of the proposed algorithm on WDBC involves one-class clustering problem. Although one-class clustering task is generally meaningless, one-class clustering-based outlier detection is especially meaningful and feasible in our proposal because our approach does not require that the membership degrees must sum up to 1. This is one of the powerful and important characteristics of the proposed algorithm.

In this subsection, we will investigate the influence of the proportion of labeled samples on the accuracies of our methodology. Two typical situations are considered and tested. The first one is that the proportion of labeled outliers increases when the number of labeled normal objects is fixed at a certain constant. The other one is that the percent of labeled normal samples varies while the quantity of labeled outliers is fixed. So two groups of experiments are designed to compare the accuracies of the proposed algorithm against the EODSP, in the situations of different percent of labeled outliers and normal samples, respectively, on the datasets Iris, Abalone, Wine, Ecoli, and WDBC. In the two experiments, the percent of labeled outliers or labeled normal samples ranges from 0% to 40%, respectively, when the number of another kind of labeled objects is fixed. We randomly select a certain number of labeled outliers or normal samples from each dataset, each experiment is repeated 10 times, and the average accuracies of SSOD-AFW and EODSP are computed.

Figure

Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled outliers on the real-world datasets.

Iris

Abalone

Wine

Ecoli

WDBC

Figure

Accuracy analysis of algorithms EODSP and SSOD-AFW with different percent of labeled normal samples on the real-world datasets.

Iris

Abalone

Wine

Ecoli

WDBC

The parameters

The parameter

Outlier detection accuracy of the proposed algorithm under various parameters on the real-world datasets.

Parameter

Parameter

Parameter

Figure

Execution time comparison of different algorithms on the real-world datasets.

In order to detect outliers more precisely, a semisupervised outlier detection algorithm based on adaptive feature weighted clustering, called SSOD-AFW, is proposed in this paper. Distinct weights of each feature with respect to different clusters are considered and obtained by adaptive iteration, so that the negative effects of irrelevant features on outlier detection are weakened. Moreover, the proposed method makes full use of the prior knowledge contained in datasets and detects outliers in virtue of the cluster structure. It is verified by a series of experiments that the proposed SSOD-AFW algorithm is superior to other typical unsupervised, semisupervised, and supervised algorithms in both outlier detection precision and running speed.

In this paper, we present a new semisupervised outlier detection method which utilizes labels of a small amount of objects. However, our method assumes that the labels of objects are reliable and does not consider mislabel punishment in the new objective function. Therefore, a robust version of the proposed method dealing with noisy or imperfect labels of objects deserves further studies. Moreover, since only one typical dissimilarity measure named Euclidean distance is discussed in our method, the SSOD-AFW algorithm is limited to outlier detection for numerical data. The future research aims at extending our method to mixed-attribute data in more real-life applications such as fault diagnosis in industrial process or network anomaly detection.

The authors declare that there is no conflict of interests regarding the publication of this paper.

This work was supported by the National Natural Science Foundation of China (Grant no. 11471001).