A Novel Research on Rough Clustering Algorithm

and Applied Analysis 3 property in the place whose ordinal number is j; that is, j ∈ S(X ∪ Y), S(X) ∩ S(Y) ⊆ S(X ∪ Y); so S(X ∪ Y) = S(X) ∩ S(Y) = S. (3)Based on the definition of set similarity and S(X∪Y) = S, we can come to the conclusion that SFD (X ∪ Y) = |S (X ∪ Y)|


Introduction
With the fast development and widespread application of computer and network technology, more and more service data are available, while these data contain a huge mass of valuable information which is hard to be detected.Therefore, lots of researchers focused on the issues and carry out some works.Clustering was proposed for the goal of group similar objects in one cluster and dissimilar objects in different clusters [1][2][3][4][5][6][7].At present, many clustering algorithms have been presented by the scholars.Maybe the most popular employed clustering algorithm is the classic k-means with applications everywhere.However, because all these algorithms are very sensitive to date space distribution or for the reason of improving algorithm efficiency, the data are compressed possibly with a loss of quality; the algorithm result is bad sometimes.Lots of control theories [8][9][10][11] have also been discussed about this issue.The rough set theory was presented by professor Pawlak in Warsaw University of Technology in the 1980s [12][13][14][15][16].It is a simplifying data theory especially in dealing with uncertain and incomplete data.The main characteristic of it is that it only uses the information provided by itself and does not need any other additional information or transcendental knowledge to pack or to discrete data or to reduce data attributes [17][18][19], and so forth.So a new clustering algorithm based on rough set is presented in this paper.

Related Definitions
Definition 1 (communication system).Let one set communication system as  = (, , , ), where  is a nonempty finite set of objects,  = { 1 ,  2 , . . .,   },   is one object in this formula;  is a set of object's properties, divided into two disjoint sets, the conditional attributes set  and the decisive attributes set ,  =  ∪ ;  is a set of attributes value,  = ∪(  ),  ∈ ,   is a domain of attribute ;  is a mapping function of  ×  → , and it gives an attribute value to each attribute of all objects, that is, ∀ ∈ ,   ∈ , (  , ) ∈   .
Definition 2 (interval partition).Let one set communication systems as  = (, , , );  () is the number of decisive kinds; a breakpoint from domain   which is formed by attribute  ∈  is marked (, ).Definition 6 (addition rule of set characteristic vector).If the objects number is , and the number of attributes which describe each object is ,  and  are two disjoint object subsets among them, correspondingly, their set characteristic vectors are SFV(X) = (|X|, S(X), SFD(X)), SFV() = (||, (), SFD()); so the addition rule of set characteristic vector is defined as where

Related Theorems
Because of the need of algorithm and based on the relation between conditional attributes and decisive attributes, as well as the related concepts of rough set theory, the following theorems are introduced and proved in this paper.Proof.Consider the following: ( Theorem 8. Let one set ∀ ⊆ , ∀ ∈  and ∉  and   =  − POS   () as a rough negative domain; there exists the following: Proof.For ∀ ∈  and ∉ , there are two conditions: (1) if  is a redundant attribute, there exists the following:  ( Therefore, Proof. (1) Because there is no intersection between set  and , and the numbers of their elements are || and ||, so the elements number of set (2) First, let us prove that ( ∪ ) ⊆ () ∩ ().For any  ∈ ( ∪ ), in set  ∪ , all objects have the same attribute property in the place whose ordinal number is , and because  ⊆ ∪, all objects in set  have the same attribute property in the place whose ordinal number is  too.So  ∈ (); by managing together, we can have  ∈ () too; hence, ( ∪ ) ⊆ () ∩ ().
On the other hand, it can be proved that () ∩ () ⊆ ( ∪ ), actually, for any  ∈ () ∩ (), because all objects in set  have the same attribute property in the place whose ordinal number is .And all objects in set  have the same attribute property in the place whose ordinal number is  too.Then in set  ∪ , all objects must have the same attribute property in the place whose ordinal number is ; that is,  ∈ ( ∪ ), () ∩ () ⊆ ( ∪ ); so ( ∪ ) = () ∩ () = .
(3) Based on the definition of set similarity and (∪) = , we can come to the conclusion that And based on the definition of characteristic vector, it is clear that To sum up, the theorem has been proved.

Algorithm Description
Before realizing the rough clustering, the discrete breakpoint should be initialized first. ... Set  = 0, and calculate the relative comentropy  0 of source information table.
Step 1. Apply the attribute significance formula to calculate the key attribute of information table; eliminate the redundant attribute.
Step 2. Based on the concept of hypercube, generalize every attribute as follows: (1) according to the decisive attribute, cluster the instances of information table ; (2) generalize the instances which belong to the same class.
Calculate the breakpoint set .
Step Step 5.According to the breakpoint set , discretize information table and calculate relative comentropy ; if  ≥  0 , turn to Step 3; else turn to Step 6.
Step 6.According to the breakpoint set , integer map the attribute values of information table is on appropriate integer map.
Step 7. In the new information table which has been discretized, each object sets up a new set, and they are, respectively, marked  (0)  ,  ∈ (1, 2, . . ., ).Based on the additive property theorem, let us calculate After combination, if the set's similarity is greater than any class' object lower similarity limit .Then  (0) 1 and  (0) 2 combine to form a set, as initial class marks  (1)  1 ; if the set's internal similarity is less than any class' object lower similarity limit .Then  (0) 1 and  (0) 2 each will be a respective new initial class, marks  (1)  1 and  (1) 2 ; furthermore, the classes number marks .
Step 9.In the finally established classes  (1)   ,  ∈ {1, 2, . . ., }, the ones which contain less objects are isolate object classes, and they could be removed according to the actual demands, so the classes left will be the final result of clustering.
For convenience of illustrating the rough clustering algorithm, the float chart is shown in Figure 1.

Simulation
The source information table is a decision table of concrete freezing resistance.In this table, the conditional attributes  1 ,  2 ,  3 ,  4 ,  5 are all continuous attributes, and their values are five checking results which describe the condensability of concrete; in information table, one decisive attribute is class; if its value is 1, it is means that the concrete freezing resistance is good; else if its value is 0, it is means that the concrete freezing resistance is bad.And the similarity threshold  of one class is defined 0.5.
Based on the attribute significance formula, we can calculate that the attribute  3 is redundant attribute; so delete it from the information table; then we can get an attribute reduction { 1 ,  2 ,  4 ,  5 } from the source information table.
By using continuous attributes discretization presented in this algorithm, after discretization of information table, we can get the discrete decisive Table 1.

Conclusions
From the clustering result, we can see that only data object  5 was wrongly clustered to different class; the other data objects' clustering results completely accord with the class classification which we have known before; from the operation of numerical example,we can find that the clustering algorithm presented in this paper has some advantages as follows: (1) because data is pretreated by the application of rough set theory in this clustering algorithm, data structure is simplified, and the clustering algorithm is simple to implement, and cluster quality is improved also; (2) this clustering algorithm is not affected by the distributional characteristics of date space, and based on the set eigenvalue, isolated objects can be eliminated.
In the example presented, data object  14 is isolated object; (3) because set eigenvectors are the operands in this clustering algorithm, and data objects' clustering and dividing operation can be finished by only scanning the information table once, this algorithm is efficient.