^{1}

^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

In a data mining process, outlier detection aims to use the high marginality of these elements to identify them by measuring their degree of deviation from representative patterns, thereby yielding relevant knowledge. Whereas rough sets (RS) theory has been applied to the field of knowledge discovery in databases (KDD) since its formulation in the 1980s; in recent years, outlier detection has been increasingly regarded as a KDD process with its own usefulness. The application of RS theory as a basis to characterise and detect outliers is a novel approach with great theoretical relevance and practical applicability. However, algorithms whose spatial and temporal complexity allows their application to realistic scenarios involving vast amounts of data and requiring very fast responses are difficult to develop. This study presents a theoretical framework based on a generalisation of RS theory, termed the variable precision rough sets model (VPRS), which allows the establishment of a stochastic approach to solving the problem of assessing whether a given element is an outlier within a specific universe of data. An algorithm derived from quasi-linearisation is developed based on this theoretical framework, thus enabling its application to large volumes of data. The experiments conducted demonstrate the feasibility of the proposed algorithm, whose usefulness is contextualised by comparison to different algorithms analysed in the literature.

Outlier detection is an area of increasing relevance within the more general data mining process. Outliers may highlight extremely important findings in a wide range of applications: fraud detection, detection of illegal access to corporate networks, and detection of errors in input data, among others.

The rough sets basic model created by Pawlak [

An outlier detection method is proposed in [

The variable precision rough sets model (VPRS) [

Global view of the theoretical framework.

The Pawlak rough sets and VPRS algorithms solve the following problem: “to determine the set of outliers of a given universe of data from a preset exceptionality threshold (

In this paper, a new approach to the problem of outlier detection that solves the limitations of the aforementioned results is proposed: to preset the thresholds and to develop scalable algorithms independent of the context and nature of the problem. Therefore, the aim of this research may be summarised as follows: “to create a computationally viable method that calculates the outlier probability of each element from a given universe of data without the need to establish preconditions—that is, the determination of the thresholds (

The starting hypothesis is summarised as follows: “a new theory may be developed by extending the basic concepts and the formal tools provided by RS theory [

To develop the method proposed in the research objective as a solution (see Figure

General outline of the proposed solution.

Based on the above, the text below is divided into four sections. In Section

In essence, the entire proposal in this article is summarized in the following two phases:

In the first, it is determined for each element

In the second phase, taking into account the determined

To solve the problem, first, we expanded the theoretical framework defined in [

The

To determine the outlier region in relation to the set of values of

After concluding the analysis of the three proposed subproblems, from the sequence of sets, a general criterion can be established defining when an internal border is a subset of another.

^{c}: set of

_{j}: set of

Considering that for all _{j}_{j}

The next step is to perform a similar analysis to determine the set of outlier threshold values

To define the set of values of _{i}

To establish a new definition of outlier degree

To determine

Following this sequence, first, the set of _{i}

Let _{i}_{i}

As established by

_{i}

_{i}

With

The first two parts of Definition _{Zk(e)}(

Graphic view of the

As a function of Definitions

With

This definition does not contradict the proposition presented in [

The definitions above enable us to establish the following general method for determining the values of

To determine

To determine _{i}_{i}

To determine _{i}_{i}

For values of _{i}

_{o}_{o}

Figure

Range of

In this section, the FIND_OUTLIER_REGION algorithm is developed. This algorithm enables the unsupervised calculation of the range of values of the thresholds

Calculation of the dependences between internal borders, or calculation of the inclusion relationship between them: BUILD_

Calculation of the outlier region in relation to the threshold

Integration of both regions to obtain, for each element of the universe, the regions of

1

2

3 S1[r][q] = {[0, 0.5)}

4 S3[r][q] = {[0, 0.5)}

5 S2[r] = {[0, 0.5)}

6

7 P_{r} = CLASSIFY-ELEMENTS (U, r)

8 class-max = 0

9 _{r}

10 case1[r][class] = {[min(c(class, X), 1 - c(class, X)), 0.5)}

11 class-max = max(class-max, c(class, X), 1 - c(class, X))

12

13 q-min = min(c(class, X), 1 - c(class, X))

14

15 q-class = CLASSIFY-ELEMENT(U, q, e)

16 q-min = min(q-min, c(q-class, X), 1– c(q-class, X))

17 case2[r][q][class] = [0, q-min)}

18 S1[r][q] = S1[r][q] ∩ (case1[r][class] ∪ case2[r][q][class])

19 S2[r] = S2[r] ∩ {[class-max, 0.5)}

20

21

22

23

24 A = {}

25

26 A = A ∪ (

27 S[r] = {[0, 0.5)} - A − S2[r]

28

All these algorithms contain the inputs

The output of the

1

2

3 class = CLASSIFY-ELEMENT(U, r, e) o

4

5 M[e][r] = {[0, λ[e][r])}

6 h = 1.0

7 prev = 0.0

8

9 base = {}

10 ExcepDegree[e] = ExcepDegree [e] ∪ {[prev, inf) × [0, h]}

11 prev = inf

12

13

Finally, the output of the

1 S =

2 <M, ExcepDegree> =

3

4 D[e] = {}

5

6 D[e] = D[e] ∪ M[e][r] ∩ S[r]

7 OUTLIERS[e] = ExcepDegree[e] ∩ {D[e] × [0, 1]}

8

The temporal complexity of the algorithms depends on the number of ranges in the sets of specific ranges. Table

Calculation of the spatial and temporal complexity of the FIND_OUTLIER_REGION algorithm by calculating the complexities of each structure of each component algorithm.

Algorithm | Data structure | Spatial complexity ( |
Temporal complexity ( |
---|---|---|---|

BUILD_ |
Case1[i][ec] | ||

Case2[i][j][ec] | |||

S1[i][j] | |||

S2[i] | |||

S3[i][j] | |||

S[i] | |||

BUILD_ |
λ[e][i] | ||

M[e][i] | |||

ExcepDegree[e] | |||

FIND_OUTLIER_REGION | D[e] | ||

OUTLIERS[e] | |||

The most original aspect of the

When executing the algorithm once for a given data universe, the specific outputs of the previous algorithms can be obtained for any value of (

In summary, the result from the execution of the algorithm contains any particular result that could be obtained from the execution of the algorithms Pawlak rough sets and VPRS. This is the main advantage of the algorithm, compared with the expected advantage from increasing its temporal and spatial complexity when used only to calculate the regions of a single element of the universe.

Nevertheless, despite the high order of temporal complexity identified in the

The

In the previous section, a theoretical framework was defined by expanding [

As mentioned above, the results from the previous section enable us to assess, for each

Considering

Then, the probability that we are interested in calculating,

Considering that

Because

We only have to replace the probability density functions of the parameters

Replacing these values in (

And because

This result may be interpreted as

This is precisely the quotient between the area of the favourable region (the region of values (

The

1 OUTLIERS =

X, R)

2

P[e] = 0

3

4 P[e] = P[e] + PDF(rect)

5

The temporal complexity of the

Cost of determining the outlier region: temporal complexity

Cost of determining the probability: (dataset) × (total number of rectangles region ^{2}) ➔ ^{2} × ^{2})

Therefore, the temporal complexity of the algorithm

The

The algorithm validation tests have primarily focused on two aspects: comparing its run-times to those of the VPRS algorithm to obtain a realistic reference and assessing the detection quality of the

The

Figure

Comparison of run-times between the

The curves show that both algorithms behave similarly—regarding the run-time—and that they are computationally efficient when analysing a large dataset with high dimensionality. Furthermore, the run-times are linear and advantageously require no preset thresholds.

This finding shows that although the order of temporal complexity for the BM_PROB algorithm is quadratic in the worst case, it may reach an almost linear order of temporal complexity when analysing datasets that are normally distributed.

Again, all experiments conducted yielded similar results; therefore, in this study, one of them is shown as a representative example. In this case, the dataset used was the Arrhythmia Data Set (data of patients with cardiovascular problems) from the UCI Machine Learning Data Repository [

The concept

_{1}: was established from the attribute heart rate: mean number of heart beats per minute of each person. The equivalence relation partitions the dataset into two equivalence classes: [44, 61] and [62, 163]

_{2}: was established from the attribute number of intrinsic deflections: number of arterial bypasses of each person. The equivalence relation partitions the dataset into two equivalence classes: [0, 59] and [60, 100]

_{3}: was established from the attribute height: height of a person expressed in centimetres. The equivalence relation partitions the dataset into two equivalence classes: [60, 175] and [176, 190]

Here, 12 outliers with contradictory values for low-weight people were intentionally injected into the dataset. The normal values of the attributes considered in the equivalence relations for low-weight people are as follows: heart rate >65, intrinsic deflections <50, and height <170 cm. Table

Outliers injected into the test dataset.

ID | Weight (kg) | Heart rate | # intrinsic deflections | Height (cm) |
---|---|---|---|---|

1 | 15 | 17 | ||

2 | 31 | 93 | ||

3 | 39 | 130 | ||

4 | 10 | 16 | ||

5 | 19 | |||

6 | 20 | |||

7 | 25 | |||

8 | 29 | |||

9 | 33 | 90 | ||

10 | 40 | 20 | ||

11 | 26 | |||

12 | 38 | 92 |

In the test, the following

Number of injected outliers found between the

The number of most likely elements (

Table

Outlier probability of the 12 elements injected into the dataset.

ID | Weight (Kg) | Heart rate | # of intrinsic deflections | Height (cm) | Outlier probability |
---|---|---|---|---|---|

1 | 15 | 17 | 0.61884 | ||

2 | 31 | 93 | 0.7557252 | ||

3 | 39 | 130 | 0.6151009 | ||

4 | 10 | 16 | 0.61884 | ||

5 | 19 | ||||

6 | 20 | ||||

7 | 25 | ||||

8 | 29 | ||||

9 | 33 | 90 | 0.7557252 | ||

10 | 40 | 20 | 0.61884 | ||

11 | 26 | ||||

12 | 38 | 92 | 0.7557252 |

Most outlier detection techniques and algorithms analysed are designed, to a greater or lesser extent, to solve a specific type of problem, even in a specific case. Valid comparisons between these algorithms are difficult to perform because they will considerably depend on the search target. However, it is interesting to perform a comparative study of the different existing methods highlighting the advantages from the current proposal in its field—the unsupervised provision of general results regarding all elements of the data universe by establishing specific initial conditions: concept and equivalence relations. Considering the above, Table

Characteristics of the RS-based methods compared to the limitations of conventional methods.

(i) Applicability to datasets with a mixture of continuous and discrete attributes. Equivalence relationships are a natural way to discretise continuous data. |

(ii) Neither knowing the data distribution nor establishing data |

(iii) Specifically, for |

(iv) The dimensionality and dataset size do not limit the execution of the algorithms. |

(i) There is no need to establish data density criteria in the dataset. |

(ii) The dimensionality of the dataset does not limit the execution of the algorithm. |

(iii) No time-consuming calculations are necessary, including calculating the |

(iv) |

(v) |

(i) No time-consuming processes must be previously established, for example, network training, required in some neuronal network models to ensure their learning. |

(ii) The dimensionality of the dataset does not limit the execution of the algorithms. |

(iii) The functionality of the algorithms does not depend on data |

(iv) There is no need to model the data |

(v) Some approaches based on supervised networks establish the use of thresholds for various purposes in the |

(i) In contrast to most detection methods, which require successive executions of the algorithm until obtaining the set of outliers that actually meets the analysis criteria, |

The main advantage of RS-based proposals and, particularly, of the

After comparing algorithms based on conventional techniques and algorithms based on the RS model, a summary of the comparative study conducted between different RS algorithms and the proposed

Comparative table of RS-based algorithms.

Advantages | Disadvantages |
---|---|

(i) Shows the computational viability of the |
(i) DETERMINISTIC classification. |

(i) Shows the computational viability of the |
(i) The user must define the |

(i) Shows the computational viability of the |
(i) Temporal complexity: |

(i) Shows the computational viability of the method defined. |
(i) Temporal complexity: |

Whereas VPRS has been applied to problems in multiple fields [

The algorithms presented demonstrate the computational feasibility of the proposed methods. Furthermore, they provide efficient computational solutions—in terms of temporal and spatial complexity—to the problems for which they were conceived.

The method proposed solved, in addition, other limitations of several detection methods: it may be applied to datasets with a mixture of types of attributes (continuous and discrete); its application requires no prior knowledge about the data distribution; within the scope of its application, the size and dimensionality of the dataset do not limit its correct operation; and no distance or density criteria must be established for the dataset to apply this algorithm.

The results reported in the present study are the beginning of an in-depth study in the context of the general problem of outlier detection based on the RS model. Therefore, several problems that have not yet been solved may be identified and may be the next objectives of this on-going study. Accordingly, the following objectives have been identified: (a) to further improve the run-time of the algorithms by creating a distributed execution mechanism to use the computational power of several machines in one domain. In the current version of the algorithms, the user has to execute them on a single personal computer (PC), and (b) in the current version of the

The main dataset used to support the findings of this study is public and you can access it in Maching Learning Repository: Arrhythmia Data Set at URL

The authors declare that they have no conflicts of interest.

The authors received Fund no. TIN2016-78103-C2-2-R.

This work has been supported by University of Alicante projects GRE14-02 and Smart University.