1. Introduction

JAM

Journal of Applied Mathematics

1687-0042 1110-757X

Hindawi

10.1155/2017/4323590

4323590

Research Article

A Greedy Clustering Algorithm Based on Interval Pattern Concepts and the Problem of Optimal Box Positioning

http://orcid.org/0000-0002-8830-4679

Nersisyan

Stepan A.

¹ Pankratieva

Vera V.

¹ Staroverov

Vladimir M.

¹ Podolskii

Vladimir E.

¹ Fotakis

Dimitris

Faculty of Mechanics and Mathematics

Lomonosov Moscow State University

Leninskie Gory 1

Moscow 119991

Russia

msu.ru

2017

2592017

2017 19 06 2017 15 08 2017 2592017

2017

This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

We consider a clustering approach based on interval pattern concepts. Exact algorithms developed within the framework of this approach are unable to produce a solution for high-dimensional data in a reasonable time, so we propose a fast greedy algorithm which solves the problem in geometrical reformulation and shows a good rate of convergence and adequate accuracy for experimental high-dimensional data. Particularly, the algorithm provided high-quality clustering of tactile frames registered by Medical Tactile Endosurgical Complex.

Russian Science Foundation

16-11-00058

1. Introduction

We consider the problem of clustering, that is, splitting a finite set X⊂Rd into disjoint subsets (called clusters) in such a way that points from the same cluster are similar (with respect to some criterion) and points from different clusters are dissimilar (see, e.g., [1]). It is convenient to present the input data in the form of a numerical context (table) whose rows correspond to objects and columns correspond to attributes of the objects.

Formal concept analysis (FCA) is a data analysis method based on applied lattice theory and order theory. The object-attribute binary relation is visualized with the use of the line diagram of the concept lattice. Within the framework of this theory a formal concept is defined as a pair (extent, intent) obeying a Galois connection (for exact definitions see the monograph [2] by Ganter and Wille).

There exist several generalizations of FCA to fuzzy and numerical contexts. One of them is known as the theory of pattern structures introduced by Ganter and Kuznetsov in [3]. An important particular case of pattern concepts, which are the key object in the theory of pattern structures, is interval pattern concepts with the operation of interval intersection. Interval pattern concepts allow one to apply cluster analysis to rows of formal numerical contexts. In this case the criterion of similarity consists in belonging of all the differences between the values of the corresponding attributes to given intervals.

It can be easily seen that the problem of finding an interval pattern concept of maximum extent size (i.e., cardinality) can be reformulated as the problem of the optimal positioning of a d-dimensional box with given edge lengths for the given set X, that is, finding a position of the box that maximizes the number of points of the set X enclosed by the box (the details are given below in Section 2.2).

The existing algorithms that solve the problem of finding the optimal position of a box do not allow one to obtain an exact or at least approximate solution for high-dimensional data within a reasonable time (see a detailed survey in Section 2.2). The main goal of this paper is to propose a greedy algorithm which gives an approximate solution to this problem and a clustering algorithm based on the optimal positioning problem. We propose a clustering algorithm with (1)Odnlog⁡n+d3n1-1/dsminfn,dncminworst-case time and Odn space complexity, where fn,d denotes the number of iterations of the main stage of the algorithm, and parameters smin and cmin regulate the duration of each iteration. Greater number of iterations and greater duration of each iteration provide better approximation.

The rest of the paper is organized as follows. In Section 2 we introduce the main definitions and formalize the statement of the problem. In Sections 3 and 4 we formulate our algorithms. In Sections 5 and 6 we describe the validation results and make some concluding remarks.

2. Main Definitions and Statement of the Problem

In this section we start with the main definitions from the theory of formal concepts and then present a geometrical reformulation of the problem of finding the interval pattern concept of maximum extent size (we call it simply the maximum interval pattern concept).

2.1. Formal Concepts

Let us recall the main definitions which we need to formalize our clustering method based on interval pattern concepts. Additional details can be found in [2, 3].

Definition 1.

An upper (lower) semilattice is a partially ordered set M,≤ such that for any elements x,y∈M there exists a unique least upper bound (greatest lower bound, resp.).

Definition 2.

A semilattice operation on the set M is a binary operation ⊓: M×M that features the following properties for a certain e∈M and any elements x,y,z∈M: (i)

x⊓x=x (idempotency).

(ii)

x⊓y=y⊓x (commutativity).

(iii)

x⊓y⊓z=x⊓y⊓z (associativity).

(iv)

e⊓x=e.

Definition 3.

A lattice is an ordered set L,≤ which is at the same time an upper and a lower semilattice.

Definition 4.

Let P,≤P and Q,≤Q be partially ordered sets. A Galois connection between these sets is a pair of maps φ: P→Q and ψ: Q→P (each of them is referred to as a Galois operator) such that the following relations hold for any p1,p2∈P and q1,q2∈Q: (i)

p1 ≤P p2⇒ φp1 ≥Q φp2 (anti-isotone property).

(ii)

q1 ≤Q q2 ⇒ ψq1 ≥P ψq2 (anti-isotone property).

(iii)

p1 ≤P ψφp1 and q1 ≤Q φψq1 (isotone property).

Applying the Galois operator twice, namely, ψφp and φψq, defines a closure operator.

Definition 5.

A closure operator ·¯ on M is a map that assigns a closure X¯⊆M to each subset X⊆M under the following conditions: (i)

X≤Y⇒ X¯≤Y¯ (monotony).

(ii)

X≤X¯ (extensity).

(iii)

X¯¯=X¯ (idempotency).

Definition 6.

A pattern structure is a triple G,D,⊓,δ, where G is a set of objects, D,⊓ is a meet-semilattice of potential object descriptions, and δ: G→D is a function that associates descriptions with objects.

The Galois connection between the subsets of the set of objects and the set of descriptions for the pattern structure G,D,⊓,δ is defined as follows: (2)A□≔⨅g∈Aδg, where A⊆G,d□≔g∈G∣d⊑δg, where A⊆G.

Definition 7.

A pattern concept of the pattern structure G,D,⊓,δ is a pair A,d, where A⊆G is a subset of the set of objects and d∈D is one of the descriptions in the semilattice, such that A□=d and d□=A; A is called the pattern extent of the concept and d is the pattern intent.

A particular case of a pattern concept is the interval pattern concept. The set D consists of the rows of a numerical context, which are treated as tuples of intervals of zero length. An interval pattern concept is a pair A,d, where A is a subset of the set of objects and d is a tuple of intervals with ends determined by the smallest and the largest values of the corresponding component in the descriptions of all objects in A.

Interval pattern concepts are convenient to use in the analysis of numerical contexts, when there is a need to divide all data into clusters that comprise objects in which the numerical data is similarly “distributed” in the rows.

For each component of an interval pattern concept we introduce the width σ: the difference between the largest and the smallest values of the component. Then a clustering procedure can be defined using a standard greedy approach. Specifically, at each step the maximum interval pattern concept is identified, that is, an interval pattern concept with the maximum number of objects, whose width with respect to each component does not exceed a predefined σ. The objects of the identified interval pattern concept are combined into a cluster and excluded from the set of objects analyzed at subsequent steps.

In Example 1 presented in Table 1 the objects are pupils and the numerical data of the context consist of the grades they got at exams in various disciplines.

Table 1

A fuzzy formal context, where the objects are pupils and the attributes are disciplines.

	Arts	Mathematics	Computer science	Sports
A	9	9	10	9
B	8	2	6	5
C	6	5	10	7
D	8	9	9	6
E	8	4	6	9
F	6	5	2	10

We need to divide the set of pupils into clusters in such a way that the grades of pupils in the same cluster differ by at most 1 for each of the disciplines. Such a setting corresponds to σ=1; in this case we obtain 6 clusters (interval pattern concepts whose width is not greater than 1), each containing one pupil. In the case σ=2 we arrive at the same 6 clusters.

When σ=3 we have five clusters A,D□=8,9,9,9,9,10,6,9, B, C, E, F, and in the case σ=4 we obtain three clusters A,C,D□=[6,9],[5,9],[9,10],[6,9], B,E□=[8,8],[2,4],[6,6],[5,9], F.

Example 2. In the conditions of the previous example let us set σ1=1, σ2=1, σ3=10, σ4=3. Then the set of pupils can be divided into four clusters A,D, C,F, B, E: (3)A,D□=8,9,9,9,9,10,6,9,C,F□=6,6,5,5,2,10,7,10,B□=8,8,2,2,6,6,5,5,E□=8,8,4,4,6,6,9,9.

Clustering methods based on interval pattern concepts find applications in the analysis of experimental data. For instance, applications of such methods to gene expression analysis were discussed in [4, 5].

2.2. Geometry

Let P be a set of n points in Rd (d∈N) and δ1,δ2,…,δd be a set of positive real numbers.

Definition 8.

A d-orthotope (also called a box) with center x=x1,…,xd∈Rd and edge lengths δ1,δ2,…,δd is the Cartesian product of the intervals (4)x1-δ12,x1+δ12×⋯×xd-δd2,xd+δd2.

It can be easily seen that the problem of identification of maximum interval pattern concept can be reformulated in terms of finding the optimal position of the box with the edge lengths δ1,δ2,…,δd, that is, maximizing the number of points of set P enclosed by the box. This formulation can be generalized to the problem of finding the optimal position of a ball in an arbitrary metric space, since any box can be treated as a ball in the stretched L∞ metric in which the distance ρx,y between the points x=x1,…,xd and y=y1,…,yd is defined as (5)ρx,y=max1≤i≤d⁡ δi-1xi-yi.

The problem of optimal positioning has been well studied for d=2: some lower and sharp upper bounds on complexity are known (see, e.g., [6, 7]). However, to the best of our knowledge for the case of an arbitrary dimension d no lower bounds and no efficient exact algorithms are available so far. de Figueiredo and da Fonseca noted [8] that the problem can be solved exactly in roughly Ond+1/d time by projecting the points onto a d+1-dimensional paraboloid and using half-space range searching data structures [9]. In the same paper for the case of weighted points under certain additional restrictions they also obtained a lower bound Ωnd for exact algorithms and indicated that existing algorithms for the unweighted version of the problem do not beat this lower bound in the worst case. Eckstein et al. showed that a generalization of the problem of optimal positioning whose input also includes a set of prohibited points is NP-hard [10].

Known approximate algorithms for optimal positioning also have time complexity which depends on d exponentially. For example, de Figueiredo and da Fonseca suggested an approximate algorithm [8] which solves the problem in worst-case time O3dn/εd-1, where 0<ε<1 is a given approximation parameter. Due to exponential dependence on d these approximate algorithms are also practically inapplicable in the case of high dimension, and there is a need to develop an algorithm which can produce an approximate solution in reasonable time.

3. A Greedy Algorithm for Finding an Approximately Optimal Position of a Box

In this section we present a greedy algorithm for finding an approximately optimal position of a box with edge lengths δ1,δ2,…,δd for a set P=pii=1n⊂Rd (the order in which points are listed in P is insignificant). This algorithm is auxiliary for the clustering method described in Section 4.

The algorithm has several input parameters: positive real numbers s, smin,λ<1, and a function f: N×N→N. The parameters s, smin, and λ regulate the duration of one iteration. The function f takes the values n and d as inputs and returns the number of iterations at the main stage of the algorithm. Greater number of iterations and greater duration of each iteration provide better approximation.

The algorithm includes two basic stages: the preprocessing stage and the main stage.

3.1. Preprocessing

(1)

At the first stage of our algorithm the box with the edge lengths δ1,δ2,…,δd is transformed into the unit cube (we call it simply the cube) by means of dividing the ith coordinate of each point by δi, i=1,…,d. This stage can be performed in Odn operations.

(2)

We consider the integer lattice with edges of length 1, compute the number of points of P in each cell, and denote the cell that contains the maximum number of points by C0. The cell C0 is called the base cube. Let y0∈Rd denote the center of C0. This stage requires Odn operations as well.

(3)

At the final step of the preprocessing stage we build a k-d tree data structure (which is used at the main stage to organize the fast range search) in Odnlog⁡n operations with the space complexity of Odn (see [11, 12]).

3.2. The Main Stage

Let q: 2Rd→Z+ denote the function which counts number of points of the set P in an arbitrary subset of Rd. The main idea of our algorithm consists in constructing a finite sequence of cubes that starts from a random point y in the base cube and satisfies the condition that the next cube contains more points than the previous one. Let D1y,…,Dkyy denote these cubes with centers z1y,…,zkyy, respectively. In our notation we have z1y=y and qDiy<qDi+1y for all i∈1,…,ky-1. After fn,d iterations the algorithm returns a locally optimal cube C.

Definition 9.

The t-neighborhood of a cube D with center x=x1,…,xd is the set consisting of all cubes with centers at points of the form x1,…,xi-1,xi±t,xi+1,…,xd for all i∈1,…,d, that is, all cubes obtained through translation of D along one of the axes by the distance ±t.

Now we describe the procedure of constructing the sequence of cubes. Let y be an arbitrary point in the base cube C0 and z1y=y, D1y be the cube with center at z1y, s1=s. In order to get a definite estimate on the precision of the algorithm (see Theorem 11) we initialize the first iteration deterministically by taking the center of C0 as y. Other iterations are initialized randomly.

Suppose that the cubes D1y,…,Dmy with centers z1y,…,zmy, respectively, and the numbers s1,…,sm have been already constructed. There are two possible cases.(1)

If there exists a cube D in the sm-neighborhood of Dmy such that qD>qDmy, then we set Dm+1y=D, take the center of D as zm+1y, and take sm+1=sm. In other words, if there exists a cube in the sm-neighborhood of the current cube which contains more points of P, then we move the current cube to this position.

(2)

If there are no such cubes (i.e., all cubes in the sm-neighborhood of the current cube contain at most the same number of points), then we set Dm+1y=Dmy, zm+1y=zmy, and sm+1=λsm (i.e., decrease the current step size). If sm+1<smin (the step size threshold is reached), then the procedure is ended and Dmy is returned as the procedure result.

In order to obtain acceptable time complexity we impose additional restrictions on the selection of the next cube. These assumptions are necessary to avoid the situation where the length of the sequence grows exponentially with d. Validation on experimental data confirmed that these restrictions do not essentially affect the clustering results.

Restriction 1. All cubes in the sequence must have common points with the base cube C0.

In Figure 1 we present an example of a set P for which this requirement causes a significant difference between the exact solution and the solution produced by the algorithm. However, this difference is essentially reduced at further steps of the clustering algorithm as generally it affects only the order in which clusters are constructed.

Figure 1

The base cube is colored red; the global optimum is blue. There is no way to move from the red cube to the blue one without losing touch with the base cube.

Restriction 2. For each individual coordinate it is not allowed to translate the cube in the opposite directions at different steps of the procedure described above.

The above restrictions lead to the following lemma.

Lemma 10.

The main stage of the algorithm has (6)Od3n1-1/dsminfn,dworst-case time complexity.

Proof.

First we get an upper estimate for the length ky of the sequence of cubes (for an arbitrary y∈C0). Due to Restrictions 1 and 2 we have (7)ky≤Odsmin.Thus, Restrictions 1 and 2 avoid the situation where the length of the sequence grows exponentially with d. Each step of the procedure of constructing the sequence of cubes requires 2d evaluations of the function q for the cubes (i.e., 2d range searches). With the use of a k-d tree the range search can be performed with Odn1-1/d worst-case time complexity (see [13]). The procedure of constructing the sequence of cubes involves fn,d iterations, so the above complexity bound holds.

Note that we also have a trivial estimate ky≤n, as qDmy grows, and hence ky≤qDkyy≤n. Thus, without the imposed restrictions the worst-case complexity estimate (8)Od2n2-1/dfn,dholds, and hence the restrictions can be omitted without violation of practical feasibility in case if the number of objects n has the same order as the dimensionality d.

3.3. Precision and Complexity of the Algorithm Theorem 11.

Let y0 be a center of C0, Dalg=Dky0y0 be a cube produced by an algorithm iteration which was initialized with y0 (and so for this iteration D1y0=C0), and Dopt be an optimal cube (i.e., Dopt∈arg⁡max⁡qD, where maximum is taken over all unit cubes in Rd). Then (9)12d≤qDalgqDopt≤1,and this estimate is sharp.

Proof.

The upper estimate is trivial. The lower estimate follows from the fact that Dopt is covered by at most 2d cells of the integer lattice with edges of length 1, and hence (10)qDopt≤2-dqC0≤2-dqDalg.An example that shows that the estimate is sharp is similar to the example from Figure 1. For example, we can locate the center of Dopt at the integer lattice node and put 2d points in Dopt in such a way that each cell of the integer lattice contains at most one of these points. Then, we select an arbitrary cell of the integer lattice that is distant from Dopt and put one point to this cell, which becomes C0.

Theorem 12.

The algorithm for finding an approximately optimal position of the box has (11)Odnlog⁡n+d3n1-1/dsminfn,dworst-case time complexity and Odn space complexity.

Proof.

Combining the estimates for the time and space complexity of the preprocessing stage and the main stage of the algorithm gives the bounds mentioned above.

Note that omitting Restrictions 1 and 2 results in the worst-case time complexity estimate (12)Od2n2-1/dfn,d.

4. Clustering Algorithm

Now let us consider the clustering problem, that is, the problem of splitting the given set P=pii=1n⊂Rd into mutually disjoint subsets C1,…,Ck. Following interval pattern concept approach, we construct clusters with controlled interval pattern concept width. We propose a clustering algorithm based on the greedy approach and the procedure for finding an approximately optimal position of a box described in Section 3. The algorithm is not sensitive to the order in which points P are given. The parameters of the algorithm include positive real numbers δ1,δ2,…,δd and all parameters of the positioning algorithm, namely, s,smin,λ, and fn,d.

First, we put P1=P and find an approximately optimal position D1 of the box with the edge lengths δ1,…,δd for the set P1. Now suppose that the sets D1,…,Di and P1,…,Pi have been already constructed and let Pi+1=Pi∖Di. If Pi+1=⌀ then the procedure is ended. Else we find an approximately optimal position Di+1 of the box for the set Pi+1. The output of this procedure is a set of clusters Ci=Pi∩Di.

In order to avoid producing a lot of small clusters consisting of outliers we impose one more restriction.

Restriction 3. The resulting clusters must include at least cmin objects.

With this restriction if the size of Pi+1∩Di+1 is less than cmin then the procedure ends (and points belonging to P∖(C1∪⋯∪Ci) are considered unclustered and referred to as outliers).

Restriction 3 together with Theorem 12 immediately leads to the following theorem.

Theorem 13.

The clustering algorithm has (13)Odnlog⁡n+d3n1-1/dsminfn,d·ncminworst-case time complexity and Odn space complexity.

If Restrictions 1–3 are omitted, the worst-case time complexity estimate is (14)Od2n3-1/dfn,d.

5. Validation

Validation of the clustering algorithm developed in this study was performed on a dataset of tactile images registered by the Medical Tactile Endosurgical Complex (MTEC) during examination of artificial samples. MTEC allows intraoperative mechanoreceptor tactile examination of tissues and is already used in endoscopic surgery [14–16]. As methods for automated analysis of medical tactile images are still insufficient, validation results in particular and the developed clustering algorithm in general provide new opportunities for the medical domain applications.

The key component of MTEC is a tactile mechanoreceptor [17, Fig. 1]. Its operating head is equipped with 19 pressure sensors that perform synchronous measurements 100 times per second. Each measurement result (called “a tactile frame” and consisting of 19 values) is wirelessly transmitted to a computer that performs preprocessing and visualization. The sensors are located at the operating head surface which is a circle with diameter 20 mm.

In order to create a dataset of tactile images we utilized MTEC for tactile examinations of three types of artificial samples. The samples were similar to the L-samples utilized in the study [17]—they were made using a soft silicone (Ecoflex 00-10, Shore hardness 00-10A) according to manufacturer’s instructions and had a shape of a rectangular block with length, width, and height of 40 mm, 35 mm, and 11 mm, respectively. The difference was in sizes and shapes of hard inclusions enclosed in the samples. For the first sample type (ST1) the inclusion had a form of a spherical cap with base diameter 8 mm and height 2.4 mm oriented for palpation from the convex side. For the second sample type (ST2) the inclusion had a form of a spherical cap with a base diameter 4.7 mm and height 1.7 mm also oriented for palpation from the convex side. For the third sample type (ST3) the inclusions were the same as for ST2, but they were oriented for palpation from the flat side. For all sample types the inclusions were located in the center at height of approximately 3 mm. Thus, sample types were similar with a difference in either size or convexity of the inclusion. These samples simulated tissue with malignant neoplasms.

Totally 55 tactile examinations of the described samples were performed using MTEC. The contact angle was kept approximately equal to 90∘, and inclusions were located close to the center of the operating head surface. We performed twenty-two, seventeen, and sixteen examinations for samples of ST1, ST2, and ST3 types, respectively. For each examination one tactile frame was selected, namely, the one with the largest standard deviation (SD) of values, and other tactile frames were disregarded. Visualization of tactile frames for each sample type is presented in Figures 2(a)–2(c).

Figure 2

(a–c) Examples of tactile frames for examinations of ST1 (a), ST2 (b), and ST3 (c) type samples. Pressure values are scaled to 0,255 segment and color-coded. (d) Correspondence between sensors and attributes from the new attribute space. Each hexagon represents one sensor. Middle sensors are colored in light-gray, outer sensors are colored in dark-gray. The main diagonals are shown by orange lines; the secondary diagonals are shown by blue lines. Centers of the hexagons that represent sensors belonging to both main and secondary diagonals are colored in red; belonging only to main diagonals, in orange; belonging only to secondary diagonals, in blue.

(a) (b) (c) (d)

Thus, each examination was associated with a point in R19, and the total number of points was 55. This set of points was clustered using the developed clustering algorithm, and the results were compared with the results of k-means clustering (k=3, Euclidean distance; see, e.g., [1]), which was used as a reference. Scikit-learn implementation [18] of k-means algorithm was utilized. Adjusted and raw Rand indexes (clustering result versus original classes; see, e.g., [1]) were used as compared characteristics of the clustering quality. Note that both clustering algorithms use random initialization, so multiple runs were performed for clustering quality estimation (specifically, 100 runs were performed to estimate Rand index for each algorithm with given parameters).

The results produced by both the proposed algorithm and by the k-means algorithm were unsatisfactory. However, the poor quality of the resulting clustering was predictable as examining of a single sample can result in tactile frames that are essentially different with respect to representation by a point in R19 due to rotation and slight shifts of a tactile mechanoreceptor.

To get better results we mapped the data to the new 9-dimensional space of attributes. The new attributes included(i)

SD of all values in a tactile frame;

(ii)

mean and SD of the values corresponding to 7 middle sensors;

(iii)

mean and SD of the values corresponding to 12 outer sensors;

(iv)

mean and SD of the values corresponding to sensors that belong to the main diagonals (3 diagonals each consisting of 5 sensors, 13 sensors in total; see Figure 2(d) for details);

(v)

mean and SD of the values corresponding to sensors that belong to the secondary diagonals (6 diagonals each consisting of 4 sensors, 12 sensors in total; see Figure 2(d) for details).

These attributes are robust to rotations proportional to 60∘. The values of mean and SD were computed after scaling the values to 0,1 segment.

Transition to the new attribute space essentially improved the clustering quality, but our algorithm left 10–14 points as outliers (cmin was set equal to 8; the values of s, smin, and λ were set equal to 0.9, 0.3, and 0.8 respectively, and σ was set equal to 0.27 for all attributes). A representative result of one run is presented in Table 2. Then we placed outliers points to the obtained clusters by the k-nearest neighbors algorithm (k=8, unweighted; see, e.g., [19]). A representative result for one run is presented in Table 3.

Table 2

Correspondence between the original classes and the clusters constructed by the proposed algorithm (with outliers).

	1st cluster9 points	2nd cluster13 points	3rd cluster22 points	Unclustered11 points
ST1 22 points	9 points	1 points	5 points	7 points
ST2 17 points	0 points	12 points	3 points	2 points
ST3 16 points	0 points	0 points	14 points	2 points

Table 3

Correspondence between the original classes and the clusters constructed by the proposed algorithm (no outliers).

	1st cluster11 points	2nd cluster17 points	3rd cluster27 points
ST1 22 points	11 points	3 points	8 points
ST2 17 points	0 points	14 points	3 points
ST3 16 points	0 points	0 points	16 points

Table 4 contains mean values and SDs for Rand indices and timing information.

Table 4

Dependency of Rand index values and the running time for our and k-means clustering methods on number of iterations performed (100 program runs for each value). Values of Rand index are presented in terms of medians and interquartile ranges (IQR).

Number of iterations	Clustering method	Rand index median(adjusted/raw)	Rand index IQR(adjusted/raw)	Average running time (in seconds)
20	Our method(with outliers)	0.43/0.73	0.12/0.05	0.8
	Our method(no outliers)	0.39/0.73	0.10/0.04	0.8
	k -means	0.32/0.70	0.21/0.09	0.02

50	Our method(with outliers)	0.43/0.73	0.08/0.05	2.4
	Our method(no outliers)	0.39/0.73	0.06/0.03	2.5
	k -means	0.27/0.68	0.20/ 0.09	0.05

100	Our method(with outliers)	0.42/0.74	0.08/0.04	4.2
	Our method(no outliers)	0.39/0.73	0.06/0.03	4.3
	k -means	0.31/0.70	0.20/0.09	0.09

As one can see, the proposed algorithm has an acceptable running time, and both our and k-means algorithm reach mean quality plateau already at 20 iterations.

The advantage of the proposed algorithm over the k-means algorithm with respect to the clustering quality was statistically significant. For example, for 20 iterations and adjusted Rand index the comparison of our algorithm with outliers and the k-means on 100 runs resulted in Mann–Whitney U-test two-tailed p value equal to 1.0·10-10. As outliers are the points that are the most difficult for clustering, the advantage of our algorithm complemented by kNN-attributing of outliers to clusters over the k-means was lower but still firmly significant with Mann–Whitney U-test two-tailed p value equal to 9.5·10-4.

Interestingly, the transition to the new attribute space improved the quality of our algorithm more than the quality of the k-means clustering. For example, for 20 iterations, adjusted Rand index, and 100 runs, the comparison of the clustering quality for the initial attribute space and the new attribute space resulted in Mann–Whitney U-test two-tailed p values not exceeding 10-12 for both “with outliers” and “no outliers” versions of our algorithm, while for k-means the p value was 0.43.

6. Conclusions

In this paper we proposed a greedy clustering algorithm based on interval pattern concepts. The obtained theoretical estimate on algorithm complexity proved computational feasibility for high-dimensional spaces, and the validation on experimental data demonstrated high quality of the resulting clustering in comparison with conventional clustering algorithms such as k-means.

Particular results obtained during validation, such as a new attribute space for tactile frames registered by the Medical Tactile Endosurgical Complex, have individual significance as they provide new opportunities for the medical domain applications aimed at automated analysis of tactile images.

Data Access

Dataset of tactile frames used for the validation and the Python script that implements the developed clustering algorithm are available upon request from the authors.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors thank Dr. Alexey V. Galatenko and Dr. Vladimir V. Galatenko for valuable comments and discussions. The research was supported by the Russian Science Foundation (Project 16-11-00058 “The Development of Methods and Algorithms for Automated Analysis of Medical Tactile Information and Classification of Tactile Images”).

Everitt

B. S.

Landau

Leese

Stahl

Cluster analysis 2011 Fifth

Chichester, UK

John Wiley & Sons

Wiley Series in Probability and Statistics

10.1002/9780470977811

MR3155074

Ganter

Wille

Formal Concept Analysis: Mathematical Foundations 1999

Berlin, Germany

Springer

10.1007/978-3-642-59830-2

MR1707295

Zbl0909.06001

Ganter

Kuznetsov

S. O.

Pattern Structures and Their Projections

Conceptual Structures: Broadening the Base 2001 2120

Berlin, Heidelberg

Springer Berlin Heidelberg

129 142 Lecture Notes in Computer Science

10.1007/3-540-44583-8_10

Zbl0994.68147

Kaytoue

Duplessis

Kuznetsov

S. O.

Napoli

Two fca-based methods for mining gene expression data

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2009 5548 251 266

2-s2.0-67650519046

10.1007/978-3-642-01815-2_19

Kaytoue

Kuznetsov

S. O.

Napoli

Duplessis

Mining gene expression data with pattern structures in formal concept analysis

Information Sciences. An International Journal 2011 181 10 1989 2001

10.1016/j.ins.2010.07.007

MR2774500

2-s2.0-79952191194

Aronov

Har-Peled

On approximating the depth and related problems

SIAM Journal on Computing 2008 38 3 899 921

10.1137/060669474

MR2421071

2-s2.0-55249118861

Chazelle

B. M.

Lee

D. T.

On a circle placement problem

Computing. Archives for Scientific Computing 1986 36 1-2 1 16

10.1007/BF02238188

MR832926

Zbl0572.65051

2-s2.0-0022499805

de Figueiredo

C. M.

da Fonseca

G. D.

Enclosing weighted points with an almost-unit ball

Information Processing Letters 2009 109 21-22 1216 1221

10.1016/j.ipl.2009.09.001

MR2573324

Zbl1206.68324

2-s2.0-70349738090

Matoušek

Range searching with efficient hierarchical cuttings

Discrete & Computational Geometry 1993 10 1 157 182

2-s2.0-51249161633

10.1007/BF02573972

Eckstein

Hammer

P. L.

Liu

Nediak

Simeone

The maximum box problem and its application to data analysis

Computational Optimization and Applications. An International Journal 2002 23 3 285 298

10.1023/A:1020546910706

MR1936500

2-s2.0-0036891189

Bentley

J. L.

Multidimensional binary search trees used for associative searching

Communications of the ACM 1975 18 9 509 517

10.1145/361002.361007

Zbl0306.68061

2-s2.0-0016557674

Freidman

J. H.

Jon

L. B.

Raphael

A. F.

An algorithm for finding best matches in logarithmic expected time

ACM Transactions on Mathematical Software 1977 3 3 209 226

10.1145/355744.355745

Lee

D. T.

Wong

C. K.

Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees

Acta Informatica 1977 9 1 23 29

2-s2.0-0017631930

10.1007/BF00263763

Sadovnichy

Gabidullina

Sokolov

Galatenko

Budanov

Nakashidze

Haptic device in endoscopy

Proceedings of the 21st Medicine Meets Virtual Reality Conference, NextMed/MMVR 2014

February 2014

USA

365 368

10.3233/978-1-61499-375-9-365

2-s2.0-84897787235

Barmin

Sadovnichy

Sokolov

Pikin

Amiraliev

An original device for intraoperative detection of small indeterminate nodules

European Journal of Cardio-thoracic Surgery 2014 46 6 1027 1031

2-s2.0-84928201989

10.1093/ejcts/ezu161

Solodova

R. F.

Galatenko

V. V.

Nakashidze

E. R.

Andreytsev

I. L.

Galatenko

A. V.

Senchik

D. K.

Staroverov

V. M.

Podolskii

V. E.

Sokolov

M. E.

Sadovnichy

V. A.

Instrumental tactile diagnostics in robot-assisted surgery

Medical Devices: Evidence and Research 2016 9 377 382

2-s2.0-84995694832

10.2147/MDER.S116525

Staroverov

V. M.

Galatenko

V. V.

Zykova

T. V.

Rakhmatulin

Y. I.

Rukhovich

D. V.

Podol'skii

V. E.

Automated real-time correction of intraoperative medical tactile images: sensitivity adjustment and suppression of contact angle artifact

Applied Mathematical Sciences 2016 10 2831 2842

10.12988/ams.2016.67222

Pedregosa

Varoquaux

Gramfort

Scikit-learn: machine learning in Python

Journal of Machine Learning Research 2011 12 2825 2830

MR2854348

Zbl1280.68189

Mitchell

Machine Learning 1997

New York, NY, USA

McGraw Hill

Zbl0913.68167