Mining frequent item set (FI) is an important issue in data mining. Considering the limitations of those exact algorithms and sampling methods, a novel FI mining algorithm based on granular computing and fuzzy set theory (FI-GF) is proposed, which mines those datasets with high number of transactions more efficiently. Firstly, the granularity is applied, which compresses the transactions to some granules for reducing the scanning cost. During the granularity, each granule is represented by a fuzzy set, and the transaction scale represented by a granule is optimized. Then, fuzzy set theory is used to compute the supports of item sets based on those granules, which faces the uncertainty brought by the granularity and ensures the accuracy of the final results. Finally, Apriori is applied to get the FIs based on those granules and the new computing way of supports. Through five datasets, FI-GF is compared with the original Apriori to prove its reliability and efficiency and is compared with a representative progressive sampling way, RC-SS, to prove the advantage of the granularity to the sampling method. Results show that FI-GF not only successfully saves the time cost by scanning transactions but also has the high reliability. Meanwhile, the granularity has advantages to those progressive sampling methods.
1. Introduction
Frequent item sets (FIs) contain the items which always appear together in a dataset with the frequency over a specified minimum support [1, 2]. For instance, given a dataset which contains the records of a store, if products A and B are found always bought together by customers, and the frequency of this phenomenon is over the minimum support, {A,B} can be seen as a FI. FI can be classified into the quantitative FI (QFI) and the binary FI (BFI). QFI is mined from the quantitative dataset, where every item has a scale of value. BFI is mined from the binary dataset, where every item only has two states, presence or absence. For example, given a dataset which has four items, A, B, C, and D, a QFI may look like A∈(0~1), B∈(−1~0.5), C∈(−2~1.5), D∈(−1~3.5), and a BFI may look like {A,C} or {B,C,D}. However, in most cases, when mining FIs, those quantitative datasets are always transformed into binary datasets firstly by the discretization, so in this thesis, we mainly focus on FI mining from the binary dataset.
FI mining is the stone of association rules mining and many other fields. Many algorithms have been presented in this realm, where Apriori and FP-growth are the most two famous methods of them. Apriori is the most widely applied algorithm [3], which mines all the FIs from a dataset by several loops. In the kth loop, Apriori mines FIs with lengths k, where the FIs got from the k-1th loop are combined to make the candidate FIs of the kth loop firstly, and Apriori scans the dataset to check every candidate FI and remove the fake candidates. The most serious problem of Apriori is the repeatedly scanning of dataset. If the transaction scale of a dataset is large, the mining speed becomes low. Therefore, another algorithm, FP-growth, is proposed to solve this problem [4]. In FP-growth, the whole dataset is firstly transformed to a complex data structure, and algorithm only scans this data structure but the original dataset, which saves much time. However, problem also exists. FP-growth consumes too much memory to save this big data structure, and the transformation from dataset to data structure is also very complex.
Furthermore, many other algorithms are also proposed in this field [1, 5]; these algorithms improve the Aprori and FP-growth through removing the minimum support, using the dynamically allocated memory, improving the data structure, and so on, but none of them solve the problem brought by the large transaction scale well.
By researching the above algorithms, we can find that all of them are the exact algorithm, whose aims are to mine the exact results out. However, in some applications of FI mining, the accuracy is much less important than the speed. Therefore, there is a feasible idea which can be used to solve the problem brought by the scanning, and the main thought of it is to sacrifice some accuracy of the results and to earn the faster speed of algorithm [6].
According to this idea, some sampling methods are presented [7–11], in which the original dataset is firstly sampled randomly, and then, the algorithm mines the results just from those samples but the whole dataset, which can extremely cut the cost. However, this solution also brings another problem. Despite the kernel thought of those sampling ways is to enhance the speed through sacrificing some accuracy, the loosing of accuracy also should be controlled in a reasonable range. Generally, there are 3 kinds of ways to achieve this goal, setting the bound of sample scale, estimating the accuracy of results, and progressive sampling.
The bound of sample scale is the shortest or the largest sample scale, which announces the safe range of the sample size. Those bounds are always got by some mathematical inferences and hypothesis, such as a certain distribution of the value of items. Some researches have been done. For example, Zaki and Chakaravarthy both studied the bound of sample size with their teams [7, 8]. However, in consideration of the difference among datasets, the mathematical inferences and hypothesis, which are used to make the bound, do not always apply, so the risk to set a wrong bound is very high.
Unlike setting the bound of sample size, another method estimates the possibility of every FI mined from the sample being the FI mined from the original dataset. For example, Toivonen studied a method, which builds a candidate of FI with these probabilities based on the sample size [9]. Nevertheless, the estimation of the accuracy just offers a reference to the user, which does not give a method to reduce the error probability.
Being different from the above two ways, progressive sampling not only provides a reliable way to control the loosing of accuracy but also avoids a lot of mathematical hypothesis, in which the sample scale keeps changing until the stop condition is satisfied. For example, Parthasarathy presents a progressive way, which keeps increasing or decreasing the sample size until the stopping conditions are satisfied [10]. Chen and his colleagues proposed a method with 2 phases, which progressively builds an appropriate sample [11]. But, the difficulty of these progressive sampling ways is also apparent, which is the difficulty to build an appropriate stop condition. A simple stop condition cannot ensure the accuracy of the final results well, and the complex stop condition may consume too much time and resources.
Considering the limits of sampling way, it is necessary to come up with a new method, which not only can reduce the cost of scanning dataset but also has an efficient mechanism to ensure the accuracy of results, where the efficiency means that the computing of this mechanism is simple, the quality of this mechanism is high, and it is suitable to different datasets.
Therefore, a novel FI mining algorithm, where the scale of transactions is cut by the granularity and the support is computed by fuzzy set theory, called FI-GF, is proposed in this paper.
The definitions of granular computing are varied [12]. Generally, it is a way to solve problem in different levels through different kinds of information granules. In most cases, granule is a collection of data, which can be built by clustering, partition, and so on and can be denoted by ordinary sets, fuzzy sets, rough sets, and so on [13, 14]. In FI-GF, it is built by the partition of the set of transactions and is represented by fuzzy set.
Fuzzy set describes uncertainty, where every element has a membership to represent the degree to which it is contained [15]. Fuzzy set theory is applied widely to the FI mining from those datasets with uncertainty [16–19]. Considering that the granule is denoted by fuzzy set and the granularity may bring uncertainty, fuzzy set theory is joined in FI-GF.
In short, the new proposed algorithm, FI-GF, has the following innovations and advantages:
The granularity is used to cut the transaction scale by compressing transactions to granules, and compared with progressive sampling, it is more efficient.
Fuzzy set is used to denote granules, and a method to calculate the supports of item sets based on those granules is designed, which helps the algorithm to deal with the uncertainty brought by transaction reduction well.
2. Basic Concepts2.1. Frequent Item Sets Mining
A formal specification of the problem is presented as follows: Let I=i1,i2,…,im be a set of m distinct items. A dataset D=t1,t2,…,tn is a set of n transactions, in which ti⊆I. A set X⊆I is called an item set. X, called the length of X, is the number of items in X. Item sets whose length is l are called l-item-sets. The support of X, denoted by sup(X), is the fraction of D containing X.
X is a FI if and only if sup(X) is larger than a specific minimum support, denoted by min_sup, which indicates that the presence of X is significant. Given min_sup, the goal of FI mining is to enumerate all the item sets whose supports are higher than min_sup [20–22].
2.2. Granular Computing
Given a problem or a system Sys=e1,e2,…,en, where e is the basic element of Sys, the granular computing can be described by (1)G1,G2,…,Gm=GranularitySys,R=ProcessG1,G2,…,Gm,where the granularity can be the partition, clustering, and other process of Sys, whose kernel is to abstract or define the original problem or system at a different level and study the problem based on this new level. G is the granule, which is the basic elements of the problem or system at the new level. In every computation, G is the smallest unit, which means that the details inside are ignored after the granularity. In most cases, a granule represents a collection of e, and a granule can be defined as ordinary sets, fuzzy sets, rough sets, and so on. The process based on G is variant, which is according to the different definitions of G. R is the set of results or the set of answers of Sys [23–25].
2.3. Fuzzy Set
Fuzzy set is a widely applied tool and theory, which is proposed to represent the uncertainty concept. A fuzzy set F of a reference set U is identified by a membership function, which is denoted by F(u) [15]. For ∀u∈U, F(u) shows the degree to which u is contained by F, and U is the universal set. Generally, F(u)∈[0,1]. An ordinary set S can be regarded as a fuzzy set with the membership function S(u)∈{0,1}. The length of a fuzzy set F is(2)F=∑u∈UFu.
To operate fuzzy sets in a formal way, fuzzy set theory offers some operators, where two of the most popular formulas are shown as follows [15, 16]:(3)F∩Gu=Fu∧Gu,F∪Gu=Fu∨Gu,where a∧b=min(a,b) and a∨b=max(a,b).
3. The Proposed Algorithm FI-GF
Given a dataset D=t1,t2,…,tn, Figure 1 is the simple working diagram of FI-GF. The whole process can be divided into two main parts, which are the granularity part and the mining part. In the granularity part, the dataset is scanned and partitioned from the beginning to end, and meanwhile, those transactions which are neighboring and similar are collected and represented by several granules. Every granule in FI-GF is represented by a fuzzy set, and the reference set, which is denoted by I, of these fuzzy sets is the universe set of items of the dataset D. In the mining part, the Apriori algorithm is used to mine FIs from those granules, and the support of every item set is computed through fuzzy set theory.
The simple working diagram of FI-GF.
3.1. The Collection of Transactions
Like what Figure 2 shows, the core idea of the granularity is to compress the transaction scale. During the granularity, the dataset is scanned, and those transactions which are similar and neighboring are collected and represented by a fuzzy set.
The granularity of transactions.
Therefore, the first problem which should be solved is how to partition the dataset and collect transactions.
Given a collection of transactions g, which is used to build a granule Gg, the evaluation of g can be based on two factors of Gg [14].
(1) The coverage of granule Gg, denoted by cov(Gg), is the capability of Gg to contain information. Generally, it is an increasing function of the data scale represented by Gg. In FI-GF, it is defined as(4)covGg=-e-g+1.This factor can be easily understood. The purpose of the granularity is to save the time for scanning transactions, so the more transactions a granule can represent, the more time and resources will be saved.
(2) The specificity of granule Gg, denoted by sp(Gg), is the capability of Gg to precisely represent information. The higher sp(Gg) is, the more precisely Gg can represent the information in g. In this thesis, it is defined as (5)spGg=e-difg,where dif(g) is the difference among the transactions in g, which is calculated as(6)difg=∑tj∈gHDtj,tj-1,where HD(tj,tj-1) is the hamming distance between tj and tj-1. Obviously, sp(Gg) is a decreasing function of the data scale in g.
The second factor is also important. Since the granularity is used to compress the transactions, it will certainly destroy and ignore some original information of transactions. The more transactions a granule has, the more differences among transactions are ignored. This factor is to evaluate the degree to which the granule can preserve the original information.
However, in most cases, these two factors are restricted by each other, and an appropriate granule is supposed to not only cover more data but also represent information more precisely. Therefore, the collection of transactions can be formulated as the following optimization problem:(7)MaxcovGg·spGg.
Furthermore, the weights of coverage and specificity are not always the same, so two parameters are involved to change the optimization problem into (8), where α and β are the parameters to control the importance of coverage and specificity. If α is higher, the granule tends to contain more transactions and relatively disregards of the capability to precisely represent information, and vice versa. Consider (8)MaxColGgColGg=-e-g·α+1·e-β·∑tj∈gHDtj,tj-1.
On the other hand, α and β have another purpose, which is to control the average transaction scale in the granule. Despite the transaction scales in different granules are determined by the balances of coverage and specificity, the average of those scales can be controlled to an expected value. Suppose the average of hamming distance between two transactions in a dataset is AHD; (8) can be approximately regarded as(9)ColGg=-e-g·α+1·e-β·g·AHD.Then, the expected g, which can make Col get to the peak, can also be estimated, which is(10)g=-1α·lnβ·AHDα+β·AHD,and AHD can be estimated from the sample of transactions. Therefore, through controlling the value of α and β, the average transaction scale contained by a granule can be also controlled. Generally, if both α and β become smaller, the average transaction scale in a granule becomes larger, and if both α and β become larger, the average scale becomes smaller, which can be deduced from the above approximate function.
Generally, when g is growing, the curves of cov(Gg), sp(Gg), and Col(Gg) look like Figure 3. Firstly, the information of the granule is little, and the differences of transactions in the granule are relatively few, so sp is high and cov is low. Then, with the growing of transaction scale, the granule contains more information while bringing more differences at the same time, so cov increases but sp decreases. For Col, at the beginning, the capability to contain more information and the capability to precisely represent the information of the granule are not balanced, so Col is low initially. Then, with the increase of transaction scale, the balance of cov and sp becomes better, so Col increases too. However, after the peak of Col, the information in the granule is enough but the differences among different transactions make the granule hard to precisely represent the information, so the balance of cov and sp becomes worse, and Col starts to decrease. This phenomenon is verified and shown in Experiment 1 through 5 datasets.
The example of the curves of cov, sp, and Col.
To sum up, the stop condition of a collection is whether the current value of Col(Gg) is near the optimal value or not. Considering that, after Col(Gg) passes the peak, its tendency becomes decreasing, so the stop condition of a collection is designed as whether the following inequality is satisfied. Consider(11)StopGg≤σ,where(12)StopGg=ColGg-ColGg-tg,tg-1,…,tg-q.g-{tg,tg-1,…,t|g|-q} is the collection of transactions which removes {tg,tg-1,…,t|g|-q} from g, and ti is the ith transaction which is put into g. Therefore, Stop(Gg) is used to evaluate the growth rate of Col. Parameter σ is the value to help decide if Col is near the peak. Stop(Gg)≤σ means that the growth rate of Col is low enough, and Col is probably near the peak, so this collection can be stopped. Parameter q is used to keep the calculation of the growth rate of Col from the fluctuation, which is brought by the fluctuation of HD(tj,tj-1). The growth rate of Col is not a smooth curve. If q is high, the effect of fluctuation will be reduced into a low level, but the algorithm will respond more slowly to the coming of peak, and vice versa.
3.2. Representing the Collection by a Granule
After the partition of dataset, we need to design a way to represent those collections which contains transactions and as is shown in Figure 4; fuzzy set is used by FI-GF to do this job, whose form based on Zadeh’s [15] way is shown as(13)Gg=∑i∈IGgii,in which I is the universe set of items of the dataset D and Gg(i) is the membership degree of the item i in Gg. Actually, Gg can be seen as a simplification of the collection g, and it is used to reduce the cost of scanning dataset.
The process to make a collection into a granule.
The problem that emerges here is how to determine Gg(i) of every item in I. The fuzzy statistic test is used to do this job in FI-GF [26].
If the granule Gg is regarded as a concept, a transaction t∈g can be seen as a sample of Gg. Then, the membership degree of an item i in Gg can be determined by(14)Ggi=∑t∈gincli,tg,where incl(a,b)=1 if a⊆b or incl(a,b)=0.
For example, given a granule Gg′, where g′ is the collection of transactions which will be represented by Gg′, if g′=t1={i1,i3,i4},t2={i1,i2},t3={i2,i4}, we can know that(15)incli1,t1=1,incli1,t2=1,incli1,t3=0,so Gg′(i1)=1+1+0/3=2/3, and so on. Therefore, (16)Gg′=Gg′i1i1+Gg′i2i2+Gg′i3i3+Gg′i4i4≈0.67i1+0.67i2+0.33i3+0.67i4.
To sum up, the pseudo of the granularity is designed and shown as Algorithm 1.
<bold>Algorithm 1: </bold>The granularity.
Input D=t1,t2,...,tn, α, β, σ and q.
Output GD, the set of granules, and ND, the set of the scale of data represented by every granule.
GD=⌀, ND=⌀, g=⌀.
For j=1→|D|
g=g∪{tj}.
If Stop(Gg)≤σ
Using (14) to transform g to Gg.
GD=GD∪{Gg}, ND=ND∪{|g|}, g=⌀.
End If
End For
3.3. The Calculation of Supports Based on Granules
After the granularity, transactions are compressed to a set of granules GD, and much time, which is the cost to scan the transactions, is saved. Next, how to calculate the supports of item sets through those granules should be discussed.
Considering that a granule in FI-GF is a fuzzy set, the calculation of supports in theory of fuzzy association rules can be used for [19], in which, given a set of fuzzy sets GD and an item set X⊆I, the supports of X are defined as(17)FsupX=∑Gg∈GDGgi1∧⋯∧GgiXGD,where i∈X.
However, if (17) is applied in FI-GF, two defects will emerge. Firstly, Fsup(X) only depends on the item of X with the minimum degree of membership, and the effects of others are ignored. Secondly, in Fsup(X), the contributions of all the elements in GD are the same, but the transaction scales represented by different granules are variant. Therefore, another method should be proposed to calculate supports through GD.
Before the granularity, the contribution of a transaction t to sup(X) depends on whether t covers X, so the contribution of a granule to sup(X) should be the degree to which the granule covers X. According to this thought, we firstly define the degree to which a granule G covers a fuzzy set F, which is shown as follows:(18)FinclF,G=∑u∈UFu∧Gu∑u∈UFu,where U is the reference set. Then, the contribution of a granule Gg to sup(X) can be defined as Fincl(Xγ,Gg), where Xγ is a fuzzy set whose reference set is I. If i∈X, Xγ(i)=γ or Xγ(i)=0. The parameter γ determines how strictly the degree to which Gg covers X is evaluated. If γ is larger, the evaluation is more strict, and vice versa.
For example, there is a granule which is(19)Gg′=0.67i1+0.67i2+0.33i3+0.67i4.For an item set X′={i1,i3,i4};
If γ=0.8, we can get that (20)X′γ=0.8i1+0i2+0.8i3+0.8i4,FinclX′γ,Gg′=0.67+0+0.33+0.670.8+0+0.8+0.8≈0.70.
However, if γ=0.4, we can get that(21)FinclX′γ,Gg′=0.4+0+0.33+0.40.4+0+0.4+0.4≈0.94.Obviously, Fincl(X′γ,Gg′) raises when γ becomes smaller. The meaning of Fincl(Xγ,Gg) is shown as the blue area in Figure 5.
The contribution of Gg to supp(X).
Finally, the supports of an item set X based on GD are defined as(22)GsupX=∑Gg∈GDnGg·FinclXγ,GgD,where nGg∈ND is the transaction scales which are represented by Gg.
For example, a dataset is given which only has 2 granules, Gg1 and Gg2, where |g1|=3 and |g2|=2, and(23)Gg1=0.67i1+0.67i2+0.33i3+0.67i4,Gg2=1i1+0.5i2+0.5i3+0.5i4.For an item set X′={i1,i3,i4}, we can get that when γ=0.8, Fincl(X′γ,Gg1)≈0.70 and Fincl(X′γ,Gg2)≈0.75. So, the support of X′ which is calculated based on Gg1 and Gg2 is(24)GsupX′=g1·FinclX′γ,Gg1+g2·FinclX′γ,Gg2g1+g2=3·0.7+2·0.755=0.72.
3.4. The Dynamical Minimum Support
After the granularity, Apriori is used to mine FIs based on those granules and (22). To reduce the candidate results, a dynamical minimum support, denoted by min_sup(k/l)μ, is designed to replace the original one, where l is the length of the item sets which will be mined out, k is the current loop number of Apriori, and μ is used to control the change rate of min_sup(k/l)μ.
Firstly, if the target FIs are shorter, min_sup(k/l)μ is relatively large, because, compared with the longer one, the shorter item set are always more frequent. Then, if the loop number of Apriori is higher, min_sup(k/l)μalso becomes larger, because the higher loop number causes the longer candidate results and the more complex computation, which means that more useless candidate results should be removed. Through these ways, the dynamical minimum support can efficiently reduce the cost of Apriori.
3.5. The Whole Pseudo of FI-GF
Summary, the pseudo of FI-GF, based on the granularity, Gsup(X), Apriori, and dynamical minimum support is shown as Algorithm 2.
<bold>Algorithm 2: </bold>FI-GF.
Input D, α, β, σ, q, γ, μ, l and min_sup.
Output The set of frequent l-item-sets FI.
[GD,ND] = granularity D,α,β,σ,q //Algorithm 1
L1=i∈I∣Gsupp({i})≥min_sup(1/l)μ
For k=2→l
Lk=⌀
Ck=aproiri_gen(Lk-1)
For each X∈Ck
If Gsupp(X)≥min_sup(k/l)μ
Lk=Lk∪X
End if
End for
End for
FI=Ll
Procedure apriori_gen(Lk-1)
For each l1,l2∈Lk-1
If HD(l1,l2)=1
c=l1∪l2
If each k-1 subset s of c satisfies that s∈Lk-1
Ck=Ck∪c
End if
End if
End for
Return Ck
4. The Experiments and Discussions
Several datasets are used to evaluate FI-GF. They are kosarak, T10I4D100K, retail, connect, and mushroom, which are downloaded from http://fimi.ua.ac.be/data/ and shown in Table 1. All experiments are run on Matlab under the Windows XP, and the computer which is used by us has a 2.4 GHz CPU and 2.92 GB RAM. All algorithms are coded by m language.
Datasets used in experiments.
Dataset
Number of Items
Number of trans.
Kosarak
41270
990002
T10I4D100K
1000
100000
Retail
16470
88162
Connect
130
67557
Mushroom
120
8124
Experiment 1.
In Experiment 1, the method of the granularity in FI-GF, which is Algorithm 1, is applied to the datasets in Table 1. The purpose of this experiment is to display the working principle of the granularity and to help us understand what happens when the granules are being generated.
Table 2 shows the parameter values of Algorithm 1 in Experiment 1, and Figures 6 and 7 exhibit the details of the granularity.
Figure 6 describes how the curves of cov, sp, and Col change when the transaction scales represented by those first granules in the datasets of Table 1 are growing. Cov, sp, and Col, respectively, describe the capability of a granule to contain more information, the capability of a granule to precisely represent information, and the balance of cov and sp. For every subfigure in Figure 6, the horizontal axis is the transaction scale represented by the first granule, and the vertical axis is the value of cov, sp, and Col.
Figure 7 describes the transaction scale of all the granules in the datasets of Table 1 after the granularity. For every subfigure, the horizontal axis represents the sequence numbers of granules, and the vertical axis represents the transaction scale of every granule.
By analyzing Figures 6 and 7, several phenomenon can be got and explained.
(1) According to Figure 6, we can know that when the transaction scale represented by a granule increases, cov increases and sp decreases. This phenomenon can be qualitatively explained as follows. To begin with, the more transactions a granule represents, the more information it contains, so cov increases along with the transaction scale. Then, because of the differences among transactions, the more transactions a granule represents, the more difficult it is for the granule to precisely represent the information, so sp decreases when the transaction scale goes up. On the other hand, the quantitative explanation can be got from the definition of cov and sp, which is (4) and (5). The function conv(Gg)=-e-|g|+1 increases along with g and function sp(Gg)=e-dif(g) decreases along with g, and g is the transaction scale represented by the granule Gg.
(2) Furthermore, according to Figure 6, we can see that every curve of Col goes up at the beginning and decreases after the peak. At the beginning, when the transaction scale is very low, although the granule can precisely represent those transactions, the information contained in it is too little. Therefore, when the transaction scale increases, Col goes up. Then, after the transaction scale grows to a certain level, the amount of information is enough. At the same time, the differences among transactions become significant, so Col begins to go down.
(3) According to Figures 6 and 7, we can know that the curves of cov, sp, and Col are different in different datasets, and the distributions of the transaction scales which are represented by granules are also different in different datasets. This phenomenon is caused by the diversity of difference in transactions from different datasets. The larger difference in transactions makes sp fall faster and the Col get to the peak earlier, and vice versa. On the other hand, α and β, which, respectively, decide the weight of cov and sp, also affect the shapes of curves. When α increases, the capability to contain more information becomes more important and cov grows faster, and vice versa. Meanwhile, if β increases, sp drops faster, and Col will get to the peak earlier.
(4) According to Figure 7, it is obvious that, for every dataset, the transaction scales represented by granules are variant. This phenomenon is brought by the variance of differences among transactions in different parts of a dataset. The large difference results in the faster falling of sp and the earlier peak of Col, which limits the scale of transactions, and vice versa.
To sum up, according to Figures 6 and 7, it can be known that, for a granule, the capability to contain more information and the capability to precisely represent the information are restricted by each other. Meanwhile, when the granularity is applied to different datasets or different parts in the same dataset, the changes of cov, sp, and Col and the transaction scales of granules are all different. This phenomenon is not only caused by the diversity of difference in transactions from different datasets and different parts of a same dataset but also brought by the setting of α and β in Table 2, which decides the weights of cov and sp.
Parameters of FI-GF in Experiment 1.
Dataset
α
β
σ
q
Kosarak
0.00034
0.000024
0
500
T10I4D100K
0.0035
0.00018
0
10
Retail
0.0069
0.00049
0
5
Connect
0.0069
0.0012
0
5
Mushroom
0.0347
0.0022
0
5
The curves of Col, cov, and sp when FI-GF is building the first granules of the datasets in Table 1.
Kosarak
T10I4D100K
Retail
Connect
Mushroom
The numbers of transactions of granules after the granularities of those datasets in Table 1.
Kosarak
T10I4D100K
Retail
Connect
Mushroom
Experiment 2.
In Experiment 2, FI-GF is used to mine FIs from those datasets in Table 1, where FI-GF is compared with the original Apriori to test its reliability and efficiency. To be fair, the dynamical minimum support, min_sup(k/l)μ, is also applied to Apriori. The parameters of the granularity in Experiment 2 are the same as Table 2. In addition to this, l=3, and the other parameters are shown in Table 3.
In Figure 8, the time cost by Apriori and FI-GF under γ=1 are recorded in a histogram. Parameter γ represents the stringency to evaluate the degree to which a granule contains an item set.
Table 3 describes the results which are mined by Apriori and FI-GF, where min_sup and μ, respectively, represent the base number and the parameter controlling the change rate of min_sup(k/l)μ. In Table 3, 4 groups of results are recorded, which are the results generated by FI-GF when γ are, respectively, set to 0.9, 0.95, and 1 and the results generated by the original Apriori.
From Figure 8 and Table 3, the following can be got.
(1) According to Figure 8, we can know that, compared with Apriori, FI-GF saves a lot of time. For almost every dataset, FI-GF is at least twice more efficient than Apriori. This advantage is mainly caused by the granularity in FI-GF. To mine FIs from a dataset, Apriori has to repeatedly scan the dataset. If the transaction scale of a dataset is very large, too much time will be costly. However, FI-GF firstly puts all the transactions into some granules, and the algorithm just needs to scan the granules but the original transactions, so FI-GF is more efficient.
(2) According to Table 3, it can be known that the FIs, which are mined out by the original Apriori, can always be mined out by FI-GF too. This phenomenon proves that the uncertainty, brought by the granularity and the calculation of Gsup, has been faced by FI-GF successfully. The novel algorithm can ensure the reliability of the final results.
(3) Sometimes, the number of results, which are mined out by FI-GF, are more than the number of results which are mined out by Apriori. This phenomenon can be explained by the follows. Firstly, the contribution of a granule Gg to the support of an item set X is defined by Fincl(Xγ,Gg). This formula describes the degree to which a granule contains X but not whether it contains X, so the criteria is lowered and more results are mined out. Furthermore, how strictly to evaluate Gg containing X can be adjusted by γ, which also effects the number of the final results. In summary, the extra FIs in FI-GF are caused by those methods which face the uncertainty brought by the granularity and ensure the appearance of the real FIs.
(4) Finally, when we check those results generated by FI-GF under γ=0.9, γ=0.95, and γ=1, we can find that the higher γ is, the less results are mined out. Because γ is used to control how strictly to evaluate a granule covering an item, the higher γ is, the less results can satisfy this criteria, and vice versa.
The frequent item sets mined by FI-GF and Apriori.
Dataset
min_sup
μ
Results of FI-GF
Results of Apriori
γ=0.9
γ=0.95
γ=1
Kosarak
0.1
0.5
1
3
6
1
3
6
1
3
6
1
3
6
1
3
11
1
3
11
1
3
11
1
3
11
1
6
11
1
6
11
1
6
11
1
6
11
3
6
11
3
6
11
3
6
11
3
6
11
T10I4D100K
0.0007
0.88
354
368
529
368
529
766
368
529
829
368
529
829
354
368
722
368
529
829
354
368
766
368
766
829
354
368
829
529
766
829
354
529
722
354
529
766
354
529
829
354
722
766
354
722
829
354
766
829
368
529
722
368
529
766
368
529
829
368
722
766
368
722
829
368
766
829
529
722
766
529
722
829
529
766
829
722
766
829
Retail
0.07
0.1
32
38
39
32
38
39
32
38
39
39
41
48
32
38
41
32
38
41
32
38
41
32
38
48
32
38
48
32
38
48
32
39
41
32
39
41
32
39
41
32
39
48
32
39
48
32
39
48
32
41
48
32
41
48
32
41
48
38
39
41
38
39
41
38
39
41
38
39
48
38
39
48
38
39
48
38
41
48
38
41
48
38
41
48
39
41
48
39
41
48
39
41
48
Connect
0.99
1
55
75
91
75
91
109
91
109
127
91
109
127
55
75
109
75
91
127
55
75
127
75
109
127
55
91
109
91
109
127
55
91
127
55
109
127
75
91
109
75
91
127
75
109
127
91
109
127
Mushroom
0.6
1
34
36
85
34
36
85
34
85
86
34
85
86
34
36
86
34
36
86
34
85
90
34
85
90
34
36
90
34
36
90
34
86
90
34
86
90
34
85
86
34
85
86
85
86
90
85
86
90
34
85
90
34
85
90
34
86
90
34
86
90
36
85
86
36
85
86
36
85
90
36
85
90
36
86
90
36
86
90
85
86
90
85
86
90
The time cost by Apriori and FI-GF.
Experiment 3.
In Experiment 3, the granularity, Algorithm 1, in FI-GF is compared with a classical and widely applied progressive sampling algorithm, RC-SS [10], to prove its advantage.
RC-SS keeps increasing sample size until the similarity between two consecutive samples approaches to a high level. The similarity is evaluated through a representative set. A representative set is constructed by some item sets, and every item set in it contains the most frequent FI, which is mined by Apriori from this sample. Given 2 representative sets R1 and R2, which are mined out from two samples S1 and S2, if R1 and R2 are similar, S1 and S2 can also be regarded as similar. The similarity between two representative sets can be simply computed as follows:(25)SmR1,R2=∑x∈R1∩R2max0,1-sup1x-sup2xR1∪R2,where sup1(x) is the support of item set x, and it is computed based on the sample S1. RC-SS also needs a minimum support, which is denoted by min_sup. This minimum support is used by Apriori to generate representative sets. Moreover, a sampling step, which decides the growth rate of sample size, is also needed, and a lowest of sample, which ensures that the final sample size will not be too small, is required too.
In this experiment, the representative set of a sample S is defined as those FIs in S which contain the most frequent item, the min_sups of RC-SS are all set to 0.1, the sampling steps are all set to 100, and the lowest sample sizes are set to 300. If 6 latest and consecutive Sm(R1,R2) are all higher than 0.9, the sampling will be stopped. The parameters of the granularity are the same as Table 2.
Figure 9 draws the time which is, respectively, consumed by the granularity and RC-SS. Figure 10 shows the transaction scales generated by RC-SS and granule scales generated by the granularity. Figure 11 shows the learning curves of the RC-SS, which describes the changes of Sm(R1,R2) when RC-SS is applied to those datasets in Table 1.
Through Figures 9–11, the following can be got.
(1) According to Figure 9, we can know that, in most cases, the time cost by the granularity is less than the time cost by RC-SS. This phenomenon can be explained through the following. To begin with, RC-SS has to repeatedly generate samples until the similarity between two consecutive samples becomes stable and high. Secondly, RC-SS has to apply Apriori to mine the representative sets out from every sample, which means that the sampling process still costs too much time to scan the transactions. Furthermore, the complexity of (25), which is the computation of similarity, also puts some burden on RC-SS.
(2) However, according to Figure 9, we can know that when the granularity and RC-SS are applied to the dataset retail, the time cost by the granularity is more than the time cost by RC-SS, and when the granularity and RC-SS are applied to the dataset kosarak, the granularity only has a slight advantage on efficiency. This phenomenon is caused by how FI-GF builds its granules. When FI-GF is building a granule, a collection of transactions needs to be built firstly. To ensure that the information in a collection can be represented more precisely, differences in the collection are computed by (6), and (6) needs to compute the hamming distance between two transactions. If the lengths of two transactions are longer, the computing of hamming distance will be more complex. Thus, considering the average transaction lengths of retail and kosarak are both very long, computations of hamming distances on them will certainly become slower, and meanwhile, the granularity takes more time.
(3) According to Figure 10, we can know that, in most cases, the granule scale which is built by the granularity is less than the transaction scale which is chosen by RC-SS. This phenomenon can be explained as follows. As mentioned in Section 3.1, the average transaction scale represented by a granule can be controlled by α and β, and the more transactions a single granule can represent, the less granules will be generated. Therefore, if α and β are set reasonably, the granule scale can be controlled in a low level, and this is also an advantage of the granularity. The granularity not only can make the simplified dataset as precise as possible but also provides a tool to control the granule scale. However, the progressive sampling can only do the first job, and the final sample size cannot be controlled. For example, through Figure 10, we can know that, even after the sampling, the final sample size of the dataset connect still approaches to 10300, which will also cost too much time for scanning.
(4) According to Figure 11, we can know that the learning curves of RC-SS are not always smooth and convergent. When RC-SS is applied to mine the datasets kosarak, retail, and mushroom, the convergences of learning curves are good. However, when RC-SS is applied to mine the datasets T10I4D100K and connect, RC-SS did not perform well. The reason of this phenomenon can be explained as follows.
The stop condition of RC-SS in this experiment is that 6 consecutive Sm(R1,R2) are all higher than 0.9. The more elements R1∩R2 has, the higher Sm(R1,R2) will be. Meanwhile, the representative set of a sample is constructed by those FIs covering the most frequent item in this sample. If several items have the similar and high frequencies, a little change of the sample may change the rank of frequency and may result in a big change of the representative sets as well, so the similarity between the original sample and the sample after the tiny change is low. Thus, the larger the variance of the frequencies of items is, the better the convergence of RC-SS will be. Therefore, the different variances of the frequencies of items in the datasets in Table 1 cause the different convergences. This characteristic shows another disadvantage of RC-SS when compared with FI-GF, which is that RC-SS cannot adapt to any dataset well.
The time consumed by RC-SS and the granularity when they are applied to the datasets in Table 1.
The number of transactions sampled by RC-SS and granules generated by the granularity.
The self-similarity of RC-SS when it is applied to the datasets in Table 1.
Kosarak
T10I4D100K
Retail
Connect
Mushroom
5. Conclusions
The following conclusions can be deduced:
FI-GF has both the high efficiency and reliability. Firstly, the granularity of it not only decreases the scale of transactions but also ensures the precision of every granule. Then, the support computed by fuzzy theory successfully adapts to the uncertainty brought by the granularity.
In most of time, compared with RC-SS, which is the most representative progressive sampling way, the granularity costs less time. Furthermore, the granularity can control the final size of simplified dataset to some extent and can adapt to more datasets.
The application of granular computing to FI mining has broad prospects. The next work is to solve the problem of the huge computation brought by the long transactions.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
HanJ.KamberM.PeiJ.LuoK.WangL.-L.TongX.-J.Mining association rules in incomplete information systemsAgrawalR.SrikantR.Fast algorithms for mining association rulesProceedings of the International Conference on Very Large Data Bases1994Santiago de Chile, Chile487499HanJ.PeiJ.YinY.Mining frequent patterns without candidate generationProceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD '00)May 2000Dallas, Tex, USAACM11210.1145/342009.335372ZakiM. J.Scalable algorithms for association miningPietracaprinaA.RiondatoM.UpfalE.VandinF.Mining top-K frequent itemsets through progressive samplingZakiM. J.ParthasarathyS.LiW.OgiharaM.Evaluation of sampling for data mining of association rulesProceedings of the 7th International Workshop on Research Issues in Data Engineering (RIDE '97)April 1997Birmingham, UKIEEE425010.1109/RIDE.1997.583696ChakaravarthyV. T.PanditV.SabharwalY.Analysis of sampling techniques for association rule miningProceedings of the 12th International Conference on Database Theory (ICDT '09)March 2009Saint-Petersburg, RussiaACM27628310.1145/1514894.15149272-s2.0-70349149820ToivonenH.Sampling large databases for association rulesProceedings of the 22nd International Conference on Very Large Data Bases (VLDB '96)September 1996Mumbai, India134145ParthasarathyS.Efficient progressive sampling for association rulesProceedings of the 2nd IEEE International Conference on Data Mining (ICDM '02)December 2002Maebashi, Japan35436110.1109/ICDM.2002.11839232-s2.0-78149344572ChenB.HaasP.ScheuermannP.A new two-phase sampling based algorithm for discovering association rulesProceedings of the International Conference on Knowledge Discovery in Databases (KDD '02)2002462468YaoY. Y.Granular computing: basic issues and possible solutionsProceedings of the 5th Joint Conference on Information Sciences (JCIS '00)February-March 2000Atlantic City, NJ, USA186189PedryczW.PedryczW.HomendaW.Building the fundamentals of granular computing: a principle of justifiable granularityZadehL. A.Fuzzy setsDuboisD.HüllermeierE.PradeH.A systematic approach to the assessment of fuzzy association rulesHongT.-P.LinK.-Y.ChienB.-C.Mining fuzzy multiple-level association rules from quantitative dataPeiB.ZhaoS.ChenH.ZhouX.ChenD.FRAP: mining fuzzy association rules from a probabilistic quantitative databaseDelgadoM.MarínN.SánchezD.VilaM.-A.Fuzzy association rules: general model and applicationsSalamA.KhayalM. S. H.Mining top−k frequent patterns without minimum support thresholdTzvetkovP.YanX.HanJ.TSP: mining top-k closed sequential patternsAgrawalR.ImielinskiT.SwamiA.Mining association rules between sets of items in large databasesProceedings of the ACM SIGMOD International Conference on Management of DataMay 1993ACM2072162-s2.0-0027621699PedryczW.Granular computing-the emerging paradigmPedryczW.From numeric models to granular system modelingBargielaA.PedryczW.The roots of granular computingProceedings of the IEEE International Conference on Granular ComputingMay 2006Atlanta, Ga, USA80680910.1109/GRC.2006.16359222-s2.0-33751092530WangP. Z.From the fuzzy statistics to the falling random subsets