The aim of this study is focusing the issue of traditional clustering algorithm subjects to data space distribution influence, a novel clustering algortihm combined with rough set theory is employed to the normal clustering. The proposed rough clustering algorithm takes the condition attributes and decision attributes displayed in the information table as the consistency principle, meanwhile it takes the data supercubic and information entropy to realize data attribute shortcutting and discretizing. Based on above discussion, by applying assemble feature vector addition principle computiation only one scanning information table can realize clustering for the data subject. Experiments reveal that the proposed algorithm is efficient and feasible.
1. Introduction
With the fast development and widespread application of computer and network technology, more and more service data are available, while these data contain a huge mass of valuable information which is hard to be detected. Therefore, lots of researchers focused on the issues and carry out some works. Clustering was proposed for the goal of group similar objects in one cluster and dissimilar objects in different clusters [1–7]. At present, many clustering algorithms have been presented by the scholars. Maybe the most popular employed clustering algorithm is the classic k-means with applications everywhere. However, because all these algorithms are very sensitive to date space distribution or for the reason of improving algorithm efficiency, the data are compressed possibly with a loss of quality; the algorithm result is bad sometimes. Lots of control theories [8–11] have also been discussed about this issue. The rough set theory was presented by professor Pawlak in Warsaw University of Technology in the 1980s [12–16]. It is a simplifying data theory especially in dealing with uncertain and incomplete data. The main characteristic of it is that it only uses the information provided by itself and does not need any other additional information or transcendental knowledge to pack or to discrete data or to reduce data attributes [17–19], and so forth. So a new clustering algorithm based on rough set is presented in this paper.
2. Related DefinitionsDefinition 1 (communication system).
Let one set communication system as S=(U,A,V,f), where U is a nonempty finite set of objects, U={x1,x2,…,xn}, Xi is one object in this formula; A is a set of object’s properties, divided into two disjoint sets, the conditional attributes set C and the decisive attributes set D, A=C∪D; V is a set of attributes value, V=∪(Va), a∈A, Va is a domain of attribute a; f is a mapping function of U×A→V, and it gives an attribute value to each attribute of all objects, that is, ∀a∈A, xi∈U, f(xi,a)∈Va.
Definition 2 (interval partition).
Let one set communication systems as S=(U,A,V,f); r(d) is the number of decisive kinds; a breakpoint from domain VA which is formed by attribute a∈A is marked (a,c). If la=c0a<c1a<c2a<⋯<ckaa<cka+1a=ra, Va=[c0a,c1a)∪[c1a,c2a)∪⋯∪[ckaa,cka+1a]; at any breakpoint set {(a,c1a),(a,c2a),…,(a,ckaa)} of domain Va=[la,ra] is defining a as interval partition Pa of Pa={[c0a,c1a),[c1a,c2a),…,[ckaa,cka+1a]}.
Definition 3 (comentropy).
Let one set communication table as S=(U,A,V,f), X∈U, U∣IND(D)={Y1,Y2,…,Yn}, X∣IND(C)={X1,X2,…,Xm}, for each subset Xj∈X; xij is class Yi’s samples number in subset Xj, if pij=xij/|Xj|; the comentropy is I(X1j,X2j,…,Xmj)=-∑i=1mpijlog(pij).
Definition 4 (similarity of set).
If the objects number is n, and the number of attributes which describe each object is m, m is discrete value, X is one object subset among them, and the objects number of it is marked |X|, and in all the objects’ discrete intervals of this subset, the number of attributes which have the same value is a, the similarity of set X-SFD(X) is defined as SFD(X)=a/|X|.
Definition 5 (characteristic vector of set).
If the objects number is n, and the number of attributes which describe each object is m, and X is one object subset among them, the objects number of it is marked |X|, and in all the objects’ discrete intervals of this subset, the number of attributes which have the same value is a, correspondingly, the sequence numbers of attributes are js1,js2,…,jsa; so the characteristic vector of the object set X is SFV(X)=(|X|,S(X),SFD(X)), where a=|S|, SFD(X)=|S|/|X|.
Definition 6 (addition rule of set characteristic vector).
If the objects number is n, and the number of attributes which describe each object is m, X and Y are two disjoint object subsets among them, correspondingly, their set characteristic vectors are SFV(X)=(|X|,S(X),SFD(X)), SFV(Y)=(|Y|,S(Y),SFD(Y)); so the addition rule of set characteristic vector is defined as
(1)SFV(X)+SFV(Y)=(N,S,SFD),
where SFD(X)=|S|/|X|, SFD(Y)=|S|/|Y|, N=|X|+|Y|, S=S(X)∩S(Y).
3. Related Theorems
Because of the need of algorithm and based on the relation between conditional attributes and decisive attributes, as well as the related concepts of rough set theory, the following theorems are introduced and proved in this paper.
Theorem 7.
Let one set decision table as S=(U,A,V,f), where U is a nonempty finite set of objects, A is a set of object’s properties, A=C∪D, and C∩D=∅, ∀B⊆C; let one set U′=U-POSBU(D) as a rough negative domain; if U′∣IND(B∪{c})={m1,m2,…,mp}, U′∣IND(D)={n1,n2,…,nq}; in this rough negative domain, there exists the following: POSB∪{c}U′(D)=∪i=1mPOSB∪{c}U′(D)ni.
Proof.
Consider the following:
(2)POSB∪{c}U′(D)=⋃j=1mB∪{c}_(Xj)=⋃j=1m{Yi∈U′∣B∪{c}:Yi⊆Xj}=⋃j=1m{Yi∈{Y1}∪{Y2}∪⋯∪{Yn}:Yi⊆Xj}=⋃j=1m{Yi∈{Yi}:Yi⊆Xj}=⋃j=1mPOSB∪{c}U′(D)ni.
Theorem 8.
Let one set ∀B⊆C, ∀c∈C and ∉B and U′=U-POSBU(D) as a rough negative domain; there exists the following: |POSB∪{c}U′(D)|=|POSB∪{c}U(D)|-|POSBU(D)|.
Proof.
For ∀c∈C and ∉B, there are two conditions:
if C is a redundant attribute, there exists the following: |POSB∪{c}U′(D)|=|POSBU′(D)|=0; so |POSB∪{c}U(D)|=|POSBU(D)|-|POSBU(D)|=0;
if C is an important attribute, there exists the following: U′=U-POSB(D); so, U=U′+POSB(D), that is POSB∪{c}U(D)=POSBU(D)∪POSB∪{c}U′(D); so |POSB∪{c}U(D)|=|POSB∪{c}U′(D)|+|POSBU(D)|.
Theorem 9.
If the objects number is n, and the number of attributes which describe each object is m, X and Y are two disjoint object subsets among them, and X combines with Y to form set X∪Y, correspondingly, their set characteristic vectors are
(3)SFV(X)=(|X|,S(X),SFD(X)),SFV(Y)=(|Y|,S(Y),SFD(Y)),SFV(X∪Y)=(|X∪Y|,S(X∪Y),SFD(X∪Y)),SFV(X)+SFV(Y)=(N,S,SFD).
Therefore,
(4)SFV(X∪Y)=SFV(X)+SFV(Y).
Proof.
(1) Because there is no intersection between set X and Y, and the numbers of their elements are |X| and |Y|, so the elements number of set X∪Y is |X|+|Y|; that is, |X∪Y|=|X|+|Y|=N.
(2) First, let us prove that S(X∪Y)⊆S(X)∩S(Y). For any j∈S(X∪Y), in set X∪Y, all objects have the same attribute property in the place whose ordinal number is j, and because X⊆X∪Y, all objects in set x have the same attribute property in the place whose ordinal number is j too. So j∈S(X); by managing together, we can have j∈S(Y) too; hence, S(X∪Y)⊆S(X)∩S(Y).
On the other hand, it can be proved that S(X)∩S(Y)⊆S(X∪Y), actually, for any j∈S(X)∩S(Y), because all objects in set x have the same attribute property in the place whose ordinal number is j. And all objects in set y have the same attribute property in the place whose ordinal number is j too. Then in set X∪Y, all objects must have the same attribute property in the place whose ordinal number is j; that is, j∈S(X∪Y), S(X)∩S(Y)⊆S(X∪Y); so S(X∪Y)=S(X)∩S(Y)=S.
(3) Based on the definition of set similarity and S(X∪Y)=S, we can come to the conclusion that
(5)SFD(X∪Y)=|S(X∪Y)||N(X∪Y)|=|S|N=SFD.
And based on the definition of characteristic vector, it is clear that
(6)SFV(X∪Y)=(|X∪Y|,S(X∪Y),SFD(X∪Y))=(N,S,NS,SFD)=SFV(X)+SFV(Y).
To sum up, the theorem has been proved.
4. Algorithm Description
Before realizing the rough clustering, the discrete breakpoint should be initialized first…. Set W=∅, and calculate the relative comentropy E0 of source information table.
Step 1.
Apply the attribute significance formula to calculate the key attribute of information table; eliminate the redundant attribute.
Step 2.
Based on the concept of hypercube, generalize every attribute as follows:
according to the decisive attribute, cluster the instances of information table;
generalize the instances which belong to the same class.
Calculate the breakpoint set W.
Step 3.
According to breakpoint set, calculate the relative comentropy E of information table. If E>E0, then turn to Step 6; else turn to Step 4.
Step 4.
Based on the integral discretization of information table, partially discretize the newly divided regions as follows.
Let’s set the two discrete sets as [lij,uij], [lik,uik]; if the new class which has been clustered by the two set’s instances subsets does not contain any different instance, then these two sets should be clustered to one class, which then forms a new breakpoint set W.
Step 5.
According to the breakpoint set W, discretize information table and calculate relative comentropy E; if E≥E0, turn to Step 3; else turn to Step 6.
Step 6.
According to the breakpoint set W, integer map the attribute values of information table is on appropriate integer map.
Step 7.
In the new information table which has been discretized, each object sets up a new set, and they are, respectively, marked Xi(0), i∈(1,2,…,n). Based on the additive property theorem, let us calculate
(7)SFV(X1(0)∪X20)=SFV(X1(0))+SFV(X20).
After combination, if the set’s similarity is greater than any class’ object lower similarity limit b. Then X1(0) and X2(0) combine to form a set, as initial class marks X1(1); if the set’s internal similarity is less than any class’ object lower similarity limit b. Then X1(0) and X2(0) each will be a respective new initial class, marks X1(1) and X2(1); furthermore, the classes number marks c.
Step 8.
According to the set X3(0), k∈{1,2,…,c}, calculate SFV(X3(0)∪Xk(1)); seek i0, thus we have SFD(X3(0)∪Xi0(1))=mink={1,2,…,c}SFD(X3(0)∪Xk(1)).
If SFD(X3(0)∪Xi0(1)) is greater than any class’ object lower similarity limit b. Then X3(0) and Xi0(1) combine to form a set, as a class after updating still marks Xi0(1); if SFD(X3(0)∪Xi0(1)) is less than any class’ object lower similarity limit b. Then let X3(0) be a new initial class, marks Xc+1(1), and the classes number c=c+1.
Step 9.
In the finally established classes Xk(1), k∈{1,2,…,c}, the ones which contain less objects are isolate object classes, and they could be removed according to the actual demands, so the classes left will be the final result of clustering.
For convenience of illustrating the rough clustering algorithm, the float chart is shown in Figure 1.
Float chart of rough clustering algorithm.
5. Simulation
The source information table is a decision table of concrete freezing resistance. In this table, the conditional attributes a1, a2, a3, a4, a5 are all continuous attributes, and their values are five checking results which describe the condensability of concrete; in information table, one decisive attribute is class; if its value is 1, it is means that the concrete freezing resistance is good; else if its value is 0, it is means that the concrete freezing resistance is bad. And the similarity threshold b of one class is defined 0.5.
Based on the attribute significance formula, we can calculate that the attribute a3 is redundant attribute; so delete it from the information table; then we can get an attribute reduction {a1,a2,a4,a5} from the source information table.
By using continuous attributes discretization presented in this algorithm, after discretization of information table, we can get the discrete decisive Table 1.
Decisive table after discretization.
U
a1
a2
a4
a5
Class
1
0
0
0
0
1
2
0
0
1
1
1
3
0
0
0
0
1
4
0
0
0
0
1
5
0
1
0
0
1
6
0
0
0
0
1
7
0
0
0
0
1
8
0
0
0
0
1
9
0
1
0
1
0
10
1
1
0
0
0
11
1
1
1
1
0
12
1
1
1
1
0
13
1
1
1
0
0
14
0
0
1
0
0
15
1
1
0
0
0
16
0
0
0
1
0
Set up one set for each client, and respectively, mark Xi(0), i∈{1,2,...,16}.
Combine X1(0) and X2(0), in the new set X1(0)∪X2(0); the same attributes set of data objects x1 and x2 is S={a1,a2}; from this, we can work out the similarity SFD(X1(0)∪X2(0)) of set X1(0)∪X2(0) as
(8)SFD(X1(0)∪X2(0))=|S|N=24=0.5.
After combination, since the set’s internal similarity is not less than any class’ object lower similarity limit 0.5, then X1(0) and X2(0) combine to form a set, as initial class marks X1(1); then the number c of initial class is 1.
Again, sets X3(0), X4(0), and X1(1) combine to form a set; in this new set, the same attributes set of data objects x1, x2, x3 and x4 is S={a1,a2}; then the set similarity is SFD(X3(0)∪X4(0)∪X1(1))=|S|/N=2/4=0.5.
After combination, since the set’s internal similarity is not less than any class’ object lower similarity limit 0.5, then X1(0), X2(0), X3(0), and X4(0) combine to form a set, as initial class marks X1(0); then the number c of initial class is still 1. Incorporating X5(0), sets X5(0) and X1(1), and form a new set; in this new set, the same attributes set of data objects x1, x2, x3, x4 and x5 is S={a1}; then the set similarity is SFD(X5(0)∪X1(1))=|S|/N=1/4=0.25. After Combination, since the set’s internal similarity is less than any class’ object lower similarity limit 0.5, then let X5(0) be a new initial class, marks X2(1), and the classes number c turn to 2. Calculating SFD(X6(0)∪Xk(1)), k∈{1,2,...,c} and seek i0, thus we have SFD(X6(0)∪Xi0(1))=maxk∈{1,2,…,c}SFD(X6(0)∪Xk(1)).
If SFD(X6(0)∪Xi0(1)) is greater than any class’ object lower similarity limit 0.5. Then X6(0) and Xi0(1) combine to form a set, as an initial class after updating, still marks Xi0(1); if SFD(X6(0)∪Xi0(1)) is less than any class’ object lower similarity limit 0.5. Then let X6(0) be a new initial class, marks Xc+1(1), and c=c+1. For Xi(0), carry out the similar operations in turn. Until we get the final initial classes {x14}, {x1,x2,x3,x4,x6,x7,x8,x14,x16}, {x5,x9,x15}, {x9,x10,x11,x12,x13}.
6. Conclusions
From the clustering result, we can see that only data object x5 was wrongly clustered to different class; the other data objects’ clustering results completely accord with the class classification which we have known before; from the operation of numerical example,we can find that the clustering algorithm presented in this paper has some advantages as follows:
because data is pretreated by the application of rough set theory in this clustering algorithm, data structure is simplified, and the clustering algorithm is simple to implement, and cluster quality is improved also;
this clustering algorithm is not affected by the distributional characteristics of date space, and based on the set eigenvalue, isolated objects can be eliminated. In the example presented, data object x14 is isolated object;
because set eigenvectors are the operands in this clustering algorithm, and data objects’ clustering and dividing operation can be finished by only scanning the information table once, this algorithm is efficient.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
Acknowledgments
This present work was supported partially by the Polish-Norwegian Research Programme (Project no. Pol-Nor/200957/47/2013). The authors highly appreciate the above financial supports.
BeanC.KambhampatiC.Autonomous clustering using rough set theoryHoT. B.NguyenN. B.Nonhierarchical document clustering based on a tolerance rough set modelLingrasP.ElagamyA.AmmarA.ElouediZ.Iterative meta-clustering through granular hierarchy of supermarket customers and productsNgoC. L.NguyenH. S.A tolerance rough set approach to clustering web search resultsSinghA.Grid fuzzy clustering: tongue diagnosisSongW.ParkS. C.Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clusteringYuD.-R.HuQ.-H.BaoW.Combining rough set methodology and fuzzy clustering for knowledge discovery from quantitative dataYinS.DingS. X.HaghaniA.HaoH.ZhangP.A comparison study of basic data-driven fault diagnosis and process monitoring methods on the benchmark Tennessee Eastman processYinS.LuoH.DingS.Real-time implementation of fault-tolerant control systems with performance optimizationZhaoX.ZhangL.ShiP.KarimiH.Novel stability criteria for TS fuzzy systemsZhaoX.ZhangL.ShiP.KarimiH.Robust control of continuous-time systems with state-dependent uncertainties and its application to electronic circuitsLiF.YeM.ChenX.An extension to Rough c-means clustering based on decision-theoretic Rough Sets modelLingrasP.Rough set clustering for web miningProceedings of the IEEE International Conference on Fuzzy Systems (FUZZ '02)May 2002103910442-s2.0-0036454238LingrasP.Applications of rough set based k-means, Kohonen SOM, GA clusteringPalS. K.MitraP.Multispectral image segmentation using the rough-set-initialized EM algorithmQuestierF.Arnaut-RollierI.WalczakB.MassartD. L.Application of rough set theory to feature selection for unsupervised clusteringYinS.WangG.KarimiH. R.Data-driven design of robust fault detection system for wind turbinesYinS.YangX.KarimiH. R.Data-driven adaptive observer for fault diagnosisYinS.DingS. X.SariA. H. A.HaoH.Data-driven monitoring for stochastic systems and its application on batch process