^{1,2}

^{1}

^{3}

^{2}

^{2}

^{2}

^{1,2}

^{1}

^{2}

^{3}

For many applications, finding rare instances or outliers can be more interesting than finding common patterns. Existing work in outlier detection never considers the context of deep web. In this paper, we argue that, for many scenarios, it is more meaningful to detect outliers over deep web. In the context of deep web, users must submit queries through a query interface to retrieve corresponding data. Therefore, traditional data mining methods cannot be directly applied. The primary contribution of this paper is to develop a new data mining method for outlier detection over deep web. In our approach, the query space of a deep web data source is stratified based on a pilot sample. Neighborhood sampling and uncertainty sampling are developed in this paper with the goal of improving recall and precision based on stratification. Finally, a careful performance evaluation of our algorithm confirms that our approach can effectively detect outliers in deep web.

As a result of the rapid development of e-commerce, the deep web has been increasingly valued by data mining researchers in recent years. The deep web, which is termed to make a contrast with the surface web, refers to data sources with back-end databases that are only accessible through a query interface [

This paper focuses on the problem of outlier detection over the deep web. To the best of our knowledge, this problem has not yet been addressed in existing work. An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism. There is a great practical significance in detecting outliers over the deep web. For example, outliers may be commodities that have abnormal price because of mistakes during data entry. The third-party collaborators of a website want to detect outliers in time and notify the responsible person for the website to modify the data for the purpose of decreasing losses, while the website users have a great interest in finding these commodities.

Outlier detection has always been a hot research topic in the field of data mining. On this problem, a lot of exciting results have been published in recent years. As the survey [

A naive solution for outlier detection in deep web is to download all the records from a back-end database and then mine its outliers using a traditional outlier detection approach discussed in the survey [

Thus, a practical solution is to randomly sample the back-end database of a deep web to detect outliers. The back-end database is a kind of hidden database. Sampling for outlier detection in a hidden database has been studied in [

Our idea in this paper on outlier detection on deep web is primarily related to distance-based outlier detection. We formally define that an instance

In summary, this paper first presents a completely novel problem: outlier detection in deep web. Then, it proposes and empirically evaluates a stratification-based outlier detection method over deep web. The detailed contributions of our solution can be concluded as follows. First, we present a completely novel problem: outlier detection in deep web. We have developed a stratification scheme for a deep web data source. In our method, the stratification is done through a hierarchical tree that models the relationship between the input and output attributes based on a pilot sample. Second, we have proposed a stratification-based outlier detection method over deep web. Instead of random sampling across the strata, we have developed a neighborhood sampling scheme for collecting more outliers. Query spaces with high probability of containing outliers are explored. Finally, we have developed an uncertainty sampling algorithm to verify the uncertain instances in order to improve the outlier detection precision.

The rest of this paper is organized as follows. In Section

Before we discuss our solution for outlier detection over deep web, we mainly introduce the basic process of sampling and outlier detection in the deep web environment in this section.

Let us consider an example where Table

An example of a back-end database for electronic products.

ID | Brand | Type | Screen (inch) | Price ( |
StandBy (h) |
---|---|---|---|---|---|

1 | Samsung | Phone | 3.5 | 666.51 | 36 |

2 | Samsung | Phone | 3.5 | 356.06 | 30 |

3 | Samsung | Phone | 4.3 | 378.8 | 30 |

4 | Samsung | Laptop | 13.3 | 666.51 | 6 |

5 | Samsung | Laptop | 11.6 | 1107.53 | 8 |

6 | Apple | Phone | 4.0 | 801.21 | 36 |

7 | Apple | Phone | 4.0 | 696.81 | 36 |

8 | Apple | Phone | 3.5 | 498.18 | 24 |

9 | Apple | Laptop | 15.4 | 2831.51 | 10 |

10 | Apple | Laptop | 13.3 | 1180 | 10 |

Among the existing methods of detecting outliers, a distance-based outlier (DB-Outlier) detection is one of the most commonly used and simplest approaches. An object

According to our knowledge, outlier detection in the deep web is a completely novel problem. It engages outlier detection with the deep web. There is no existing solution available. In the following section, we will introduce our solution gradually.

After having introduced the basic process of sampling and outlier detection in deep web environment, it is the time for us to elaborate the problem and our proposed method in this section. This paper primarily considers the case of categorical input attributes. Continuous input attributes can be considered as discrete categorical ones.

Typically, given a query composed of values of one or more of the input attributes, a deep web data source will return the number of data records satisfying the input query. Using this information, the distribution on the input attributes can be obtained. Since the distribution of the output attributes is unknown, discovering the outliers on the output attributes is a great challenge. However, if the relationship between the input attributes and the output attributes is known, we can identify the outliers of the output attributes with using the distribution of the input attributes. In our method, the relationship between the input attributes and the output attributes is built by stratification, which is a process of dividing an entire population into subpopulations based on a pilot sample.

There are two other important steps in our approach after stratification, which are neighborhood sampling and uncertainty sampling. The goal of these two sampling steps is to collect more outliers and keep a suitable precision under a limited cost. For each record we obtained, we assign it a probability of being an outlier. We classify each record into three classes (i.e., outlier, normal, and uncertain) based on its probability.

In general, our approach proceeds as follows. We obtain a pilot sample by randomly sampling the deep web first. Stratification of the population is conducted based on the pilot sample. Then, neighborhood sampling across the subpopulations is performed to collect more outliers. Next, in order to identify the uncertain ones, we perform uncertainty sampling so as to avoid the misjudgment. Each step of our approach will be explained in detail as follows.

When the known population consists of several significantly differential parts, the population is always divided into subpopulations called strata for samples to adequately reflect the distribution of the population. In our algorithm, stratification is performed so that data records contained in the same stratum should be as similar as possible. Thus, we can isolate outliers in a few strata. The whole data in a deep web data source can be considered as the entire query space, whereas the subpopulations correspond to the query subspaces. After the stratification process, the distribution of output attributes is predicted effectively by the values of input attributes in each stratum. Query submission from the corresponding subspace can thus help us to obtain subpopulation data records. The primary purpose of stratification is to identify and group similar data records included in each stratum. Thus, how to perform stratification is an important issue.

We adopted the strategy of building a hierarchical tree to stratify the deep web data source. Formally,

For a potential splitting input attribute

In most cases, the input query space would be overstratified so that each stratum contains only one sample. Thus, we utilize a statistical hypothesis test to check whether the decrease of radius is significant. The idea behind the hypothesis test is that if there is not a significant relationship between the splitting input attribute

The statistics

In Algorithm

(1)

(2)

(3)

(4)

(5)

(6)

(7)

(8)

(9)

(10)

(11) initialize the radius of child-spaces

(12)

(13) compute

(14)

(15)

(16)

(17) compute the statistics

(18)

(19)

(20)

(21)

(22)

(23)

(24)

(25)

For each potential splitting attribute, the radius decrement is computed according to (

After stratification, similar data objects tend to be from the same query subspace or neighbor query subspaces. It indicates that we can obtain more outliers from the query subspace or its neighbor query subspaces in which we have identified outliers. Thus, the urgent key consideration is how to identify a data object’s abnormality in the deep web. For a data object

Suppose that we have obtained

The variance for

According to (

Using the definition of

Since query submitting is expensive, we need to collect more outliers with a low query cost. In other words, we want to retrieve outliers as much as possible within a given query cost. If the probability of each stratum containing outliers is given, we could achieve this goal easily by assigning more query cost to the stratum with a high probability.

For each data record

This shows that the sample size of the

In this subsection we introduce our uncertainty sampling method. The uncertainty here is with respect to the possibility for a data record belonging to the outliers. According to the degree of uncertainty, data records can be divided into three classes: (a) the outlier class; (b) the normal class; and (c) the uncertain class. For the data records belonging to (a) or (b) class, we can surely identify their abnormality. But for the data records belonging to (c) class, there is a great possibility of misjudgment occurring. The misjudgment, acknowledging true outlier as the normal class or true normality as the outlier class, will lead to a low precision. The fundamental reason for the misjudgment occurring is that there is a diversity between the distribution of samples and the underlying population, which leads to the incorrect estimation of the fraction. To solve this problem, we developed an uncertainty sampling with the goal of reducing the variance of estimated fraction

Now, we formulate our description above. Let

Thus, a set of uncertain data objects will be picked out. To improve the precision, our task is to obtain a sample for identifying uncertain data records. For uncertain data records

The solution for the objective to be minimized above is

Using the definition of the summation variance, we have

Under the limitation

Our uncertainty sampling method can be viewed as a generalization of the

A distance between each pair of sampled data records is computed and then we compute the probability

A sufficient condition for identifying a data record is as follows.

For the outliers,

For the normal data records,

For a sampled data record

If

Otherwise,

Now, we summarize our overall method for detecting outliers from a deep web data source. The overall process is shown in Algorithm

(1)

(2)

(3)

(4

(5)

(6)

(7)

(8)

(9)

(10)

(11)

(12)

(13)

(14)

(15)

(16)

(17) US

(18)

(19)

(20)

(21)

(22)

(23)

(24)

(25)

(26)

(27)

(28)

(29)

At the beginning, a pilot sample is drawn from the entire population of the deep web data source by random sampling [

In this section, we will evaluate the benefits from using the stratification strategy over the query space, neighborhood sampling, and uncertainty sampling, respectively, and compare our proposed method with the baseline on the deep web using three different datasets. (1) A synthetic dataset is generated by MATLAB. (2) HTTP and SMTP are the subsets of KDD CUP 1999 that could be downloaded from the UCI repository; HTTP and SMTP are benchmark datasets for outlier detection. (3) A live experiment is conducted on

Our evaluation has been performed over a combination of real and synthetic datasets.

Our synthetic dataset is generated by MATLAB. It contains 4100 data records, including 4000 normal records and 100 outlier records. There are seven attributes (i.e., 5 categorical input attributes and 2 continuous output attributes). Four clusters exist on the two output attributes, which are generated by a Gaussian distribution. The output attributes are created to be dependent on the input attributes.

Two real datasets, referred to as HTTP and SMTP, are the subsets of KDD CUP 1999 that could be downloaded from the UCI repository. The HTTP dataset contains 623091 data records while the SMTP dataset contains 96554 data records. Preprocessing has been performed on these two datasets. We randomly sample 8000 data records from these two datasets, respectively, as our final experimental datasets. The original dataset contains 41 attributes, whereas we only reserve two basic attributes (i.e., “

We conduct live experiments over a subset of a real-word hidden database (

In our experiments, we set the significant level, where the corresponding statistics is 1.96, at the step of overstratification within the stratification. We adopt the three common criteria to evaluate all methods, which is precision, recall, and

Outlier records of synthetic datasets HTTP and SMTP are known in order to evaluate our methods on live deep web database

For each dataset, we repeat the independent experiments 50 times, and the average results are reported as our final results. The

Our proposed method, referred to as SNU, will be compared with the following methods:

SRS: we refer to the simple random sampling method as SRS. Compared with our proposed algorithm, this method works in a manner that uses random sampling in each sampling step. This method is a baseline method.

SN: this method is similar to our proposed algorithm except that it uses the random sampling replacing the uncertainty sampling.

SU: this method is similar to our proposed algorithm except that it uses the random sampling replacing the neighborhood sampling.

In this subsection, we focus on evaluating the benefits from stratification and neighborhood sampling. For this purpose, we compare our method SNU with two other methods, SRS and SU.

Figures

Evaluation of stratification and neighborhood sampling (recall).

Synthetic

HTTP

SMTP

Live Yahoo

Figures

The remaining Figure

We now evaluate the benefits of uncertainty sampling to further identify the uncertain data records. For this purpose, we compare our methods with two other methods, SRS and SN.

Figure

Evaluation of uncertainty sampling (precision).

Synthetic

HTTP

SMTP

Live Yahoo

After evaluating the benefits of each component of our method SNU, we will evaluate the performance of our method SNU in terms of detecting outliers over deep web. For this purpose, we compare our method with the baseline method SRS on the four datasets. Figure

Evaluation of detection performance (

Synthetic

HTTP

SMTP

Live Yahoo

There exists a vast majority of research on the deep web. However, these researches focus on how to build an interactive query system or a vertical search system using data integration technologies [

Outlier detection using a sampling strategy has been described by various researchers [

Dasgupta et al. [

This paper presents a novel problem in deep web: outlier detection over deep web. In this paper, we proposed a novel solution that can divide the outlier detection procedure into three components: stratification over deep web, neighborhood sampling, and uncertainty sampling. We developed the stratification scheme through a hierarchical tree to model the relationship between the input attributes and the output attributes. Instead of random sampling across the strata, we developed a neighborhood sampling scheme for collecting more outliers. Furthermore, we developed the uncertainty sampling algorithm to verify the uncertain instances in order to improve detection precision. We evaluated the performance of our solution empirically via the synthetic and real datasets. Our experimental results show that our approach indeed enhances significantly recall and

The authors declare that they have no competing interests.

This research was partially supported by the Natural Science Foundation of China (nos. 61440053, 61472268, and 41201338).