Association rule hiding has been playing a vital role in sensitive knowledge preservation when sharing data between enterprises. The aim of association rule hiding is to remove sensitive association rules from the released database such that side effects are reduced as low as possible. This research proposes an efficient algorithm for hiding a specified set of sensitive association rules based on intersection lattice of frequent itemsets. In this research, we begin by analyzing the theory of the intersection lattice of frequent itemsets and the applicability of this theory into association rule hiding problem. We then formulate two heuristics in order to (a) specify the victim items based on the characteristics of the intersection lattice of frequent itemsets and (b) identify transactions for data sanitization based on the weight of transactions. Next, we propose a new algorithm for hiding a specific set of sensitive association rules with minimum side effects and low complexity. Finally, experiments were carried out to clarify the efficiency of the proposed approach. Our results showed that the proposed algorithm, AARHIL, achieved minimum side effects and CPU-Time when compared to current similar state of the art approaches in the context of hiding a specified set of sensitive association rules.
1. Introduction
Data mining has been recently applied in many areas of science and business, such as traffic accident detection [1], engineering asset health and reliability prediction [2], assessment of landslide susceptibility [3], enterprises [4], and supply chain management [5]. The discovery of association rules is one of the major techniques of data mining that extracts correlative patterns from large databases. Such rules create assets that organizations can use to expand their businesses, improve profitability, decrease supply chain costs, increase the efficiencies of collaborative product developments, and support more effective marketing [4, 5]. The competitive environment of global economy forces companies, who engage in the same business, to form an alliance for mutual benefits. In the collaboration, companies have to share information in order to shorten processing time dramatically, eliminate value-depleting activities, and improve quality, accuracy, and asset productivity [6]. However, due to legal constraints and/or competition among companies, they do not want to reveal their sensitive knowledge to other parties. Association rule hiding is an efficient solution that removes the sensitive association rules from the released database. Thus, the sensitive knowledge can be protected when sharing data between parties.
Many studies in the literature have focused on hiding sensitive association rules by reducing their support or confidence below given thresholds. Association rule hiding algorithms can be divided into three main approach classes [7], namely, border based [8, 9], exact [10, 11], and heuristic [12–22]. The border based and exact approaches aim to protect the revised positive border of frequent itemsets in order to minimize side effects. Although these approaches achieve good results for itemsets hiding, they are not conformable for minimizing the side effects when hiding a specific set of sensitive association rules. The heuristic approach does not guarantee a global optimal solution, but it usually finds a solution close to the best one in a faster response time. In 2012, Hai and Somjit [23] introduced a new direction for hiding a specific set of sensitive association rules named intersection lattice based. This approach concentrated on formulating heuristics for specifying victim items and transactions for data sanitization based on intersection lattice theories.
This study proposes an improvement of the new direction of association rule hiding named intersection lattice-based approach [23–25]. We first introduce in detail theory of intersection lattice of frequent itemsets and prove that it is applicable to the association rule hiding problem. Subsequently, we formulate two heuristics for hiding sensitive association rules with the lowest side effects. The first heuristic determines the victim item that needs to be modified and focuses on maintaining itemsets in the generating set in order to restrict lost rules. The second heuristic assigns a weight to each transaction relying on its degree of safety, the number of sensitive rules, and the number of nonsensitive association rules contained in that transaction. This study provides evidence that removing the victim item from the transactions which have the highest weight minimally produces effects on the nonsensitive association rules and the intersection lattice of frequent itemsets. An experiment is performed on a real dataset to show the performance of the proposed algorithm in real application terms, as well as comparisons with the previous studies.
The rest of this paper is organized as follows. Section 2 presents a brief review of previous works. The problem formulation is provided in Section 3. Section 4 introduces the basic concepts of lattice theory that are applied in this research. The proposed methodology is presented in Section 5. In Section 6, we present the experimental results in order to show the performance of the proposed approach compared with the state of the art approaches. The main contents presented in this study are concluded in Section 7.
2. Related Work
Recently, association rule hiding is classified into four classes, including heuristic, border based, exact based and intersection lattice based. The heuristic approach provides efficient and fast algorithms that select the appropriate transactions and items for hiding sensitive association rules using distortion or blocking technique. The distortion technique adds (or removes) selected items of sensitive association rules to (or from) specified transactions or add dummy transactions [21] to decrease support [8–14] or confidence [12, 13, 15–19] of the rules under the given thresholds in order to hide single or multiple rules [20]. Unlike the distortion, the blocking technique hides a rule by replacing the existing value of some items with an unknown value so as to reduce the support or confidence of the rule [12, 20, 22].
The border-based approach for association rule hiding was first introduced by Sun and Yu [8]. This approach specifies the revised positive and negative borders of all frequent itemsets. It then focuses on the weight of the positive border [8] or the maxmin set [9] to reduce support of the revised negative border while protecting support of the expected positive border so as to maintain the nonsensitive itemsets.
The exact approach transforms the association rule hiding into optimal problem based on the Constraints Satisfactions Problem (CSP). Menon et al. [10] formulated the CSP to specify a minimum number of transactions needed to be modified in order to hide sensitive association rules. Gkoulalas-Divanis and Verykios [11] formulated the CSP based on the revised positive and negative borders to identify candidate items for the hiding process. In this approach, the authors used a process of constraint reduction to formulate CSP in order to make all constraints in CSP to be linear and all variables in CSP to be binary. This allows the use of binary integer programming instead of integer or linear programming for CSP solutions.
The intersection lattice approach for hiding a specific set of association rules was first introduced by Hai and Somjit [23]. The proposed algorithms, ILARH [23] and HSCRIL [24], aim to hide a specific set of sensitive rules in three steps. The first step specifies a set of itemsets satisfying three conditions that (i) contain right-hand side of the sensitive rule, (ii) are maximal sub-itemset of a maximal itemset, and (iii) have minimal support among those subitemsets specified in (ii). An item in the right-hand side of the sensitive rule that is related to the specified maximal support itemset is identified as the victim item. In the second step, a set of transactions supporting sensitive rule is specified. The third step removes the victim items from specified transactions until confidence of the rule is below minimum confidence threshold. In order to reduce side effects, HCSRIL sorts the set of transactions supporting the sensitive rules in ascending order of their size before sanitizing them. Moreover, HCSRIL technically updates the released database such that the sanitization causes least impacts on the generating set. However, the lager transaction may contain fewer nonsensitive association rules. Thus, sorting transactions based on their size is not enough to restrict the lost rules.
Hai et al. [25] assigned a weight to each transaction in order to measure the impacts of hiding process on the nonsensitive association rules. Moreover, the authors formulated the victim item specification based on the measurement of the distance from sensitive rules to the set of maximal itemsets and the nearest nonsensitive association rule. Modifying the victim item on the high-weight transaction can reduce side effects. On the negative side, the constraints between frequent itemsets are not identified in the distances. Thus, modifying the victim item may avoid impacts on some nonsensitive association rules, but it cannot protect the intersection lattice of frequent itemset from being broken. So it may cause more lost rules.
This research takes full advantages of algorithms proposed in [23–25] and proposes an improvement for hiding a specific set of sensitive association rules with the lowest side effects and CPU-Time.
3. Problem Formulation
Let I={i1,i2,…,im} be a finite set of m literals. Each member of I is called an item. X is an itemset if X⊆I. A transaction t is defined by a set of items, namely, t={ik∣ik∈I,k≤m}. Let 𝒟 be a finite transaction database, namely, 𝒟={t1,t2,…,tn∣n∈N}. An itemset X⊂I is supported by a transaction t∈𝒟 if X⊆t. The frequency of an itemset X in database is support of X, denoted by α(X), and is defined as
(1)α(X)=|X(t)|,whereX(t)={t∈𝒟∣tcontainsX}.
An itemset X is called a frequent itemset if α(X)≥σ, where σ is the minimum support threshold given by users.
An association rule is the implication X→Y, where X, Y⊂I, and X∩Y=∅.
The support of a rule X→Y is defined to be the support of itemset X∪Y, that is,
(2)α(X⟶Y)=α(X∪Y).
The confidence of a rule X→Y is defined as
(3)β(X⟶Y)=α(X∪Y)α(X).
Example 1.
Let a transaction database be given as in Table 1. Let minimum thresholds be given as σ=3 and δ=70%. Frequent itemsets mined from Table 1 are shown in Table 2, and strong association rules generated from the frequent itemsets are presented in Table 3.
Transaction database.
TID
Itemset
T1
ABCD
T2
ABC
T3
ABD
T4
BD
T5
ABCD
T6
AC
T7
ABC
Frequent itemsets.
Frequent itemset
α
ABC
4
ABD
3
AB
5
AC
5
BC
4
AD
3
BD
4
A
6
B
6
C
5
D
4
Strong association rules.
Rules
β
AB →C
80%
C →AB
80%
AC →B
80%
BC →A
100%
AD →B
100%
BD →A
75%
D →AB
75%
A →B
83%
B →A
83%
A →C
83%
C →A
100%
C →B
80%
D →A
75%
D →B
100%
Let σ and δ be the minimum support threshold and the minimum confidence threshold given by users. The association rule X→Y is the strong association rule if α(X→Y)≥σ and β(X→Y)≥δ.
Lemma 2 (Apriori property [26]).
Assume that X,Y⊆I. If X⊆Y, then α(X)≥α(Y).
The Apriori property shows that if an itemset X is frequent, then all itemsets in the family of subsets of X are frequent.
The association rules discovered from a large database that can be used in the decision-making support process are said to be sensitive association rules [14].
Definition 3 (sensitive association rules).
Let 𝒟 be a transactional database, R be a set of all association rules that are mined from 𝒟, and RulesH be a set of decision support rules that need to be hidden according to some security policies. A set of association rules, denoted by SR, is said to be sensitive if SR⊂R and SR would derive the set RulesH. ~SR is the set of nonsensitive association rules such that ~SR∪SR=R.
A sensitive association rule Il→Ir is hidden if α(Il→Ir)<σ or β(Il→Ir)<δ. The rule can be hidden by
removing an item Ii∈IlIr from some transactions in order to make α(Il→Ir)<σ,
adding all items Ii∈Il to some transactions until β(Il→Ir)<δ, or
removing an item Ii∈Ir from some transactions until α(Il→Ir)<σ or β(Il→Ir)<δ.
The modifications of any item always cause, however, side effects which are the impacts of data modification on the quality of association rule mining, including lost rules, ghost rules, false rules, and accuracy.
Lost rule is a nonsensitive association rule that is discovered from the original database but cannot be mined from the released database.
Ghost rule is a nonsensitive association rule that cannot be discovered from the original database but can be mined from the released database.
False rule is the sensitive association rule that cannot be hidden by hiding process.
Accuracy is the ratio of distorted data items to total of data items in the original database.
The association rule hiding algorithm is better than the other one if it achieves lower side effects, including lower lost rules, ghost rules, false rules, and higher accuracy, and lower complexity.
The problem of association rule hiding addressed in this paper can be stated as follows.
Let a transaction database 𝒟, a minimum support threshold σ, and a minimum support threshold δ be given. Let us assume that R is a set of association rules mined from 𝒟, whose support and confidence are not less than σ and δ, respectively. Suppose that a set of certain association rules in R regarded as being sensitive, denoted by SR, can be specified. The problem is how to transform 𝒟 into a released database 𝒟’ in such a way that all sensitive association rules in SR are hidden, while nonsensitive association rules can still be mined from 𝒟’ and the side effects are minimal.
We apply method (iii) to a heuristic association rule hiding algorithm based on the intersection lattice of frequent itemsets in order to reduce the side effects.
4. Background
In this section, we recall some concepts in lattice theory that are applied in the present study. Lattice theory was developed by George Grätzer [27]. It singles out a special type of order for details of investigation. The basic concepts of lattice theory that are related to our research are presented as follows.
Let V be a nonempty set. A binary relation θ on V is said to be an order relation if θ satisfies the properties reflexivity, antisymmetry, and transitivity, namely,
reflexivity: aθa,
antisymmetry: aθb and bθa imply that a=b,
transitivity: aθb and bθc imply that aθc.
We usually use ≤ to denote an order and (V;≤) to denote an ordered set.
Let (V;≤) be an ordered set. An element a∈P is an upper bound of H⊆Vif a majorizes all h∈H. An upper bound a of H is the least upper bound of H or supremum of H if a is majorized by all upper bounds of H. In this case, we will write a=supH.
The dual concepts of upper bound and least upper bound are the lower bound and the greatest lower bound, respectively, which are defined by duality. The greatest lower bound or the infimum of H is denoted by infH.
Definition 4 (lattice).
An ordered set (L;≤) is said to be a lattice if for all a,b∈L, inf{a,b} and sup{a,b} always exist and are denoted by a∨b and a∧b, respectively.
Definition 5 (semilattice).
Let (A; o) be an algebra with one binary operation o. The algebra (A; o) is a semilattice if o is idempotent, commutative, and associative.
An algebra (L;∧,∨) is said to be a lattice if L is a nonempty set, (L;∧) and (L;∨) are semilattices, and the two absorption identities are satisfied. A lattice as algebra and a lattice as an order are proved “equivalent” concepts [27].
Let U be a finite nonempty set. It is obvious that the power set of U, denoted by Poset(U), is an ordered set under the inclusion relation ⊆. It can be verified that (Poset(U);⊆) forms a lattice, where sup{A,B}=A∪B and inf{A,B}=A∩B. If L⊆U and (L;⊆) is a lattice satisfying the properties that sup{A,B}=A∪B and inf{A,B}=A∩B, for all A and B, then (L;⊆) is called a set lattice. Similarly, if the ordered set (L;⊆) is a semilattice under intersection operation “∩” satisfying inf{A,B}=A∩B, for all A and B in L, then (L;⊆) is said to be an intersection lattice.
5. The Proposed Approach for Association Rule Hiding Based on Intersection Lattice
In this section, we specifically introduce the intersection lattice theory applied in association rule hiding that was basically presented in [23–25]. Firstly, we analyze the characteristics of the intersection lattice of frequent itemsets. Then, we improve heuristics for minimizing the side effects of association rule hiding process. Finally, we propose an efficient algorithm for hiding a specific set of sensitive association rules.
5.1. Intersection Lattice of Frequent Itemsets
In this subsection, we formulate intersection lattice theory for the set of frequent itemsets and prove the applicability of this theory into association rule hiding. Let 𝒟 be a given transaction database on a finite set of items I and let σ be a given minimum support threshold. Consider the lattice (Poset(I);⊆) and the set P(σ), denoted by a set of frequent itemsets that are mined from 𝒟 and satisfy the given threshold σ; we have the following statements.
Theorem 6 (intersection lattice of frequent itemset).
Let 𝒟 be a given transaction database on a finite set of items I and σ be a given minimum support threshold. Then, (P(σ);⊆) forms an intersection lattice, denoted by L(𝒟,σ).
Proof.
For all X,Y∈P(σ), assume that Z=X∩Y; then we have Z⊆X. By Lemma 2, we have α(Z)≥α(X)≥σ, so Z∈P(σ). In other words, we have inf{X,Y}=X∩Y.
On the other hand, the ordered set (P(σ);⊆) is a semilattice under the intersection operator ∩. Indeed, for all X,Y∈P(σ), we always have the following.
∩ is idempotent because X∩X=X.
∩ is commutative. Consider an arbitrary item x∈I. Then by the definition of set intersection, we have x∈X∩Y
x∈X∧x∈Y
x∈Y∧x∈X (by the commutativity of meet operation)
x∈Y∩X.
Hence, by universal generalization, every item which is in X∩Yis also in Y∩X.
Hence, X∩Y=Y∩X.
∩ is associative. Similar to (ii), we have (X∩Y)∩Z=X∩(Y∩Z).
In other words, the ordered set (P(σ);⊆) is a semilattice under the intersection operation such that for all X,Y∈P(σ), inf(X,Y)=X∩Y. Hence, (P(σ);⊆) is an intersection lattice.
Definition 7 (the generating set).
The generating set of L(𝒟,σ), denoted by GL, is the smallest subset of L(𝒟,σ) such that each element of L(𝒟,σ) can be represented as the (finite) intersection of some elements of GL, namely,
(4)L(𝒟,σ)={X∣X=⋂k∈N*Xk,Xk∈GL}.
Definition 7 indicates that each element of L(𝒟,σ) can be generated by an intersection of a finite number of certain elements of GL.
Lemma 8.
For all X,Y,Z∈GL, if X≠Z, Y≠Z, and X≠Y, then X∩Y≠Z.
Proof.
It can easily be seen that the statement “X,Y,Z∈GL, X≠Z, Y≠Z, and X≠Y then X∩Y≠Z” is an immediate consequence of Definition 7. Since in the opposite case, Z=X∩Y, then GL∖{Z} is obviously also a generating set of L(𝒟,σ). This means that Z∉GL, a contradiction.
Theorem 9.
For every L(𝒟,σ), the set GL is unique.
Proof.
It is obvious that if P(σ)=∅, then GL=∅. Since P(σ)≠∅, to hold Theorem 9, we have to prove two affirmations as follows.
L(𝒟,σ) always contains a GL. For all X∈P(σ), we have for all X′∈Poset(X), X′∈P(σ) (Lemma 2). By Definition 7, for all X∈L(𝒟,σ), there is a finite number of itemsets Yk∈GL such that
(5)X=⋂k∈N*Yk.
If k=1, then X=Yk; thus, we imply that X∈GL. By Lemma 8, if k≥2, then X∉GL and X is generated by an intersection of itemsets Yk∈GL. Hence, by universal generalization, for any itemset X∈L(𝒟,σ), there is a set GL such that either GL contains X or GL contains a finite set of itemsets which can generate X by taking an intersection of those itemsets. In other words, GL always exists for every intersection lattice L(𝒟,σ).
GL is unique in L(𝒟,σ). Assume that GL′ is the other generating set of L(𝒟,σ). We show that GL′=GL. First, we prove that GL⊆GL′. Indeed, take any X∈GL, by the definition of GL′,
(6)X=⋂h∈N*Yh′,
for some sets Yh′∈GL′, which implies that X⊆Yh′. By Lemma 8, if X≠Yh′, then Yh′∉GL.
On the other hand, we have
(7)Yh′=⋂j∈N*Xj′,
by the definition of GL. Consequently, we obtain the inclusion
(8)X⊆Yh′=⋂j∈N*Xj′.
By Lemma 8, we infer that the set of indexes N* is single and, therefore, X=Yh′; therefore, X∈GL′, which shows that GL⊆GL′.
Similarly, we also have GL′⊆GL. In other words, GL′=GL.
Theorem 10.
The set GL is calculated as follows:
(9)GL={X∈L(𝒟,σ)∣d(X)≤1},whered(X)=|{Y∈L(𝒟,σ)∣X⊂Y}|.
Proof.
Let X be an itemset in L(𝒟,σ). Assume that X∈GL and d(X)≥2. Then, X can be generated by the intersection of some itemsets in GL, namely,
(10)X=⋂k∈N*,k≥2Vk,whereeachVk∈GL.
By Lemma 8, X∉GL. This contradicts the assumption X∈GL. Therefore, if X∈GL, then d(X)≤1.
Example 11.
Let a transaction database 𝒟 be given as Table 1 and L(𝒟,σ) be computed as Table 2. The set GL can be computed by applying (9), namely, GL={ABC,ABD,AC,BC,AD,BD}.
Definition 12 (set of maximal elements).
An element Y of L(𝒟,σ) is said to be a maximal element, if for all X∈L(𝒟,σ) and Y⊆X;then Y=X. A set of maximal elements of L(𝒟,σ) is denoted by MAX(L(𝒟,σ)).
Lemma 13 (the maximal set in the intersection lattice and generating set).
Given an intersection lattice L(𝒟,σ), then MAX(GL)=MAX(L(𝒟,σ)).
Proof.
Assuming that X∈MAX(GL), then X∈L(𝒟,σ). Let Y∈L(𝒟,σ); then
(11)Y=⋂k∈N*Vk,whereeachVk∈GL.
Assuming that X⊆Y, we have X⊆Vk, where k∈N*. By Definition 12, X=Vk, k∈N*; hence, X=Y. Therefore, X∈MAX(L(𝒟,σ)). In other words, MAX(GL)⊆MAX(L(𝒟,σ)) (*).
Conversely, assuming that X∈MAX(L(𝒟,σ)), then
(12)Y=⋂i∈N*Si,whereeachSi∈GL.
Assuming that X⊆Y, we have X⊆Si, i∈N*. By Definition 12, X=Si, i∈N*. Thus, since Si∈GL, we have X∈GL. Then, we imply that X∈MAX(GL). In other words, MAX(L(𝒟,σ))⊆MAX(GL) (**).
By (*) and (**), we imply that MAX(GL)=MAX(L(𝒟,σ)).
Definition 14 (coatom).
Each item of L(𝒟,σ) is called an atom and each element of the set MAX(L(𝒟,σ)) is called a coatom of L(𝒟,σ). A set of all coatoms of L(𝒟,σ) is denoted by CL.
By Lemma 13 and Definition 14, we can infer the property of CL as follows.
Lemma 15 (characteristics of coatom in the intersection lattice).
For every intersection lattice L(𝒟,σ), one always has CL=
MAX
(GL)
The set CL of L(𝒟,σ) can be calculated by applying Theorem 10 to find GL and then find the maximal itemsets of the set CL.
Lemma 16.
For each itemset X∈L(𝒟,σ), Poset(X) forms a lattice and Poset(X)∖{X} has a generating set, denoted by GX, including itemsets in MAX(Poset(X)∖{X}).
Proof.
By Lemma 2, if X∈L(𝒟,σ) and Y⊆X, then Y∈L(𝒟,σ). It is obvious that (Poset(X); ⊆) is a lattice and MAX(Poset(X)∖{X})={X∖{Ik}∣IkisanitemofX}. Moreover, for item Ik∈X, k=1,2,…|X|, every itemset formed by X∖{Ik} lacks only one item, so it has a unique containing itemset in Poset(X) and it has no containing itemset in Poset(X)∖{X}. All remaining subsets in Poset(X) lack more than one item, so they have at least two containing itemsets in Poset(X)∖{X}. By Definition 7, GX includes itemsets in MAX(Poset(X)∖{X}).
For example, itemset ABE forms a lattice and GABE={AB,AE,BE}.
Lemma 17.
If every itemset of CL is not hidden, then no itemset in L(𝒟,σ) is hidden.
Proof.
By Lemma 15, we have for all X∈L(𝒟,σ)∃Y∈CL such that X⊆Y, so α(X)≥α(Y) (Lemma 2). Since α(Y)≥σ, we have α(X)≥σ.
In order to hide a sensitive association rule, this study focuses on decreasing support and confidence of the rule by removing an item belonging to its right-hand side. However, the modification of an item always affects some itemsets in L(𝒟,σ). By (2) and (3), when the support of an itemset is reduced by modifying some items, the support and confidence of association rules that contain these items will be changed. This may lead those rules to be hidden. Moreover, when an itemset is hidden, all association rules generated from this itemset are also hidden. If the hidden rules are not sensitive rules, then they are lost rules. The efficient method that allows the reduction of lost rules restricts itemsets in L(𝒟,σ) from being hidden.
By Definition 7, each itemset of intersection lattice L(𝒟,σ) can be created by an intersection of some itemsets in GL. Lemma 17 indicates that all itemsets in L(𝒟,σ) are still frequent if every itemset in CL is maintained. The generating set GL and coatom set CL therefore need to be protected from the hiding process in order to maintain L(𝒟,σ). It is possible to propose a heuristic that hides sensitive association rules with lower side effects based on GL and CL maintenance.
5.2. The Heuristics for Minimizing Side Effects of Association Rule Hiding Algorithm
In this research, we apply method (iii) to hide the rule Il→Ir by removing an item belonging to Ir from some transactions that support the rule until α(Il→Ir)<σ or β(Il→Ir)<δ. The impacts of the hiding process on L(𝒟,σ) depend on the item and transactions selection for the data modifications [24]. This study proposes an efficient improvement of the intersection lattice approach [23–25] based on two heuristics for minimizing the side effects of association rule hiding process. In this study, we prove the correctness and efficiency of the heuristic for specifying victim item that was presented in [23, 24] and propose an improvement heuristic for specifying transactions [25]. These heuristics are presented as follows.
Heuristic 1 (specifying victim item for data modifications).
For each item Ii∈Ir, modifying Ii affects support of |X|-1 itemsets in GX, where X∈CL. It is obvious that the itemset which has the smallest support in GX is the easiest to be hidden. This heuristic aims to protect those itemsets in order to restrict the impacts of the hiding process to L(𝒟,σ). Firstly, it identifies itemsets Y∈GX, where X∈CL and IlIr⊆X, which are the most vulnerable to the modification of each item in Ir.
Definition 18 (victim candidate).
The victim candidate for hiding a sensitive rule Il→Ir, denoted by Mmin(Ii,X), is a set of tuples, where each tuple contains four values: Ii∈Ir, itemset X∈CL such that IlIr⊆X, itemset Y∈GX such that Ii⊆Y and Y has minimum support in GX, and α(Y). It is computed as follows:
(13)Mmin(Ii,X)={(Ii,X,Y,λ)∣λMmin(,.X)=min{α(Y)∣Ii⊆Ir∩Y,Mmin(Iihhhhhhh,X)=Y∈GX,IlIr⊆X,X∈CL}}.
In order to maintain the set GL and CL, the modification is required with item in the same tuple with the itemsets that have maximum support among elements of Mmin(Ii,X). Such an item is said to be the victim item and is defined as follows.
Definition 19 (victim item).
The victim item for hiding the sensitive rule Il→Ir, denoted by Ivictim, is an item needed to be modified in order to hide the rule such that the modification causes the lowest impacts on L(𝒟,σ), and it is computed as follows:
(14)Mmaxmin(Il⟶Ir)={(Ivictim,X,Z,μ)∣μzzz=max{λ∣(Ii,X,Z,λ)∈Mmin(Ii,X)}}.
Function Mmaxmin(Il→Ir) shows that the item Ivictim needs to be removed from transactions that support the rule Il→Ir. If there are more than two tuples in Mmaxmin(Il→Ir), then the victim item is selected randomly from those tuples.
Theorem 20.
Equation (14) always returns a victim item for association rule hiding.
Proof.
By Lemmas 13 and 15, for every rule Il→Ir, there is an itemset X∈CL such that IlIr⊆X. Let Z∈GX, by Lemma 16, |Z|=|X|-1. In addition, |Ir|≤|IlIr|-1 so that |Ir|≤|Z|; therefore, there are |X|-1 itemsets Y∈GX such that Ii⊆(Ir∩Y). This indicates that the set Mmin(Ii,X) can always be specified. Obviously, we can find a tuple (Ivictim,X,Y,μ) where μ=max{α(Y)∣Y∈Mmin(Ii,X)}. In other words, the function Mmaxmin(Il→Ir) always returns the victim item Ivictim.
Theorem 21.
Modifying the victim item returned by (14) causes minimal impacts on the intersection lattice of frequent itemsets.
Proof.
According to (13), the set Mmin(Ii,X) contains all items Ii∈Ir and itemset in GL which is the most vulnerable to the modification of item Ii. Obviously, modifying an item which is contained in the same tuple with the itemset that has maximum support in Mmin(Ii,X) produces the lowest impacts on GL. Consequently, modifying Ivictim returned by (14) causes minimal impacts on L(𝒟,σ).
Heuristic 2 (specifying transaction for data modifications).
Assuming that both nonsensitive association rules X→Y and sensitive association rules Il→Ir are supported by transaction t, the rule X→Y is still strong if α(X→Y)≥σ and β(X→Y)≥δ. Let a positive integer k be assigned as the number of transactions required to be modified. To maintain the nonsensitive rule X→Y, k must satisfy the conditions α(X→Y)-k≥σ and (α(X→Y)-k)/α(X)≥δ.
Thus, we have k≤α(X→Y)-σ and k≤α(X→Y)-[α(X)*δ].
The maximal number of transactions that can be modified without hiding the nonsensitive association rules X→Y is
(15)N(X⟶Y)=min{PP(X)*α(X∪Y)-σ,α(X∪Y)-[α(X)*δ]}.
Transaction t is safe to the hiding process if no nonsensitive rule supported by t is hidden. We formulate the safety degree of transaction t, denoted by SD(t), as follows:
(16)SD(t)=min{N(X⟶Y)∣X⟶Y∈R∖SR∧Ir∩Y=minvv≠∅∧XY⊆t}.
Accordingly, no nonsensitive rule supported by t is hidden if SD(t) is above zero. In other words, we need to maintain SD(t) during the hiding process in order to restrict the nonsensitive rules from being hidden. As a result, transaction that has high safety degree should be modified first.
Let n_trans be the minimum number of transactions that need to be modified in order to hide the sensitive rule r. Then, n_trans can be computed as follows:
(17)n_trans=min{α(r)-σ+1,α(r)-⌈α(lr)*δ⌉+1},
where lr is left hand side of r.
Let Tr be a set of transactions that supports the rule r. Let Rt be a set of nonsensitive association rules supported by transaction t∈Tr, namely, Rt={X→Y∈R∖SR∣XY⊆t}. It is obvious that removing victim item from the transaction t that supports the lowest |Rt| and greatest |SR| and SD(t) causes the lowest impacts on L(𝒟,σ) and nonsensitive association rules.
For each transaction t∈Tr, a weight w(t) was assigned to measure ability of removing victim item from t so as to hide the sensitive rule r, but the modification causes the least impact on Rt:
(18)w(t)={∞ifRt=ϕ,|SR|*SD(t)|Rt|ifRt≠ϕ.
Since transaction t∈Tr does not support any nonsensitive association rule corresponding with r, w(t) will be assigned maximal value, because modifying such transaction t does not affect any nonsensitive rule. As a result, modifying the high-weight transaction contributes to restricting the lost rules.
5.3. The Proposed Algorithm
Based on the heuristics that are presented in Section 5.2, we propose a new algorithm, denoted by AARHIL (algorithm of association rule hiding based on intersection lattice), that includes two steps as follows.
Step 1 (initiation). AARHIL computes GL and CL of the intersection lattice of frequent itemsets L(𝒟,σ) using Theorem 10 and Lemma 15, respectively.
Step 2 (hiding process). AARHIL executes three sub-steps for each sensitive association rule r.
Step 2.1. AARHIL specifies a set of transactions, denoted by Tr, that fully support the sensitive rule r. The algorithm computes the weight of each transaction in Tr using (16) and (18). Then, it sorts Tr in descending order of weight.
Step 2.2. AARHIL specifies victim item using (13) and (14).
Step 2.3. The victim item will be changed when support of itemset in the same tuple with Ivictim less than
(19)max{α(Y)∣(Ii,X,Y,λ)∈Mmin(Ii,X),maxfα(Y)≠α(Z),X∈CL}.
Thus, to save the time needed for updating L(𝒟′,σ), α(r), β(r), GL′, CL′, the victim item needs to be updated from k_trans transactions in Tr, where
(20)k_trans=α(Z)-max{α(Y)∣(Ii,X,Y,λ)∈Mmin(Ii,X),k_trans=iiiiiiiiiii×yyα(Y)≠α(Z),X∈CL}+1.
Next, AARHIL updates itemsets in L(𝒟′,σ), GL, and CL.
Since the victim item Ivictim is removed from transaction t, the support of every itemset that is supported by t and contains Ivictim is decreased one unit. The intersection lattice L(𝒟′,σ) can be updated by removing all itemsets that have support less than σ from L(𝒟,σ). The generating set of L(𝒟′,σ), denoted by GL′, can be updated as follows.
For each itemset X∈GL such that α(X)<σ,
(21)GL′=GL∖{X}∪{⋂k∈N*,k≥2ZkY∣Y∈GX,GL′=GL∖{X}Y≠⋂k∈N*,k≥2Zk,Zk∈GL∖{X}}.
Then, CL′ of L(𝒟′,σ) is updated by taking the maximal itemsets of GL′:CL′=MAX(GL′).
AARHIL then computes α(r) and β(r). The algorithm repeats this step until α(r)<σ or β(r)<δ.
The details of AARHIL algorithm are presented in Algorithm 1.
Algorithm 1: The AARHIL algorithm.
Input: Original database 𝒟, thresholds σ and δ, L(𝒟,σ), association rules R and sensitive
The correctness of AARHIL was proved by Theorem 20. Moreover, by Theorem 21 and the second heuristic, AARHIL hides a set of sensitive association rules with the lowest lost rules while maintaining a high accuracy. The complexity of AARHIL is computed in Theorem 22.
Theorem 22.
Computational complexity of algorithm AARHIL is O(kF2+n+nrlognr+k
tmax
2),where kF is the number of frequent itemsets,k
tmax
is the largest transaction, nr is the greatest number of transactions supporting the sensitive rule, and n is the size of database (total number of transactions).
6. Experimental Results and Discussion
In order to measure the efficiency of proposed model, we compared our algorithm with MaxMin2 [9], WSDA [22], the algorithm proposed by Jain [15], denoted by JA (Jain Algorithm), and HCSRIL proposed by Hai et al. [24]. Moustakides and Verykios [9] showed that MaxMin2 is a more efficient method compared with the previous border-based approach [8], which has achieved better results compared with the heuristic Algorithm 2(b) in [13]. The WSDA algorithm applies heuristic to select the appropriate transactions for modifying an item on the right-hand side of the sensitive rules. The experimental results have indicated that WSDA is more efficient compared with Algorithm 1(b) in [13]. Jain at al. [15] proposed the new algorithm (JA) that overcomes ISL and DSR algorithms [28]. The HCSRIL algorithm applied heuristic on victim item selection based on intersection lattice theory.
The experiment was run on Windows 7 operating system with a Pentium Core i5 and 4 GB of RAM. Our experiments were executed using the Retail.dat dataset, which was donated by Brijs [29]. This dataset contains the retail market basket data from an anonymous Belgian retail store. It contains 88,162 transactions on 16,469 items. In order to examine the performance of the proposed algorithm compared with the previous works, we started the experiments with 30,000 transactions of dataset on 12,142 corresponding items and then extended the dataset up to the maximum. The configurations of datasets are presented in Table 4.
Configuration of datasets and number of association rules satisfy σ = 1% and δ = 10%.
Number of transactions
Number of items
Largest transaction
Number of association rules
30 k
12,142
75
340
40 k
13,462
75
316
50 k
14,413
75
256
60 k
14,997
75
239
70 k
15,780
77
238
80 k
16,014
77
236
88.162 k
16,469
77
236
We selected two sensitive association rules for the experiments. The performances of these algorithms are illustrated in the following figures.
Figure 1 shows that AARHIL algorithm produced the lowest lost rules in every dataset. In other words, AARHIL achieved the best results in minimizing the lost rules compared with HCSRIL, WSDA, JA, and MaxMin2 algorithms. By applying the support reduction method (i), MaxMin2 produced many lost rules. JA combines methods (ii) and (iii), but it does not apply a heuristic to select victim items and transactions. Thus, it produced more lost rules compared with WSDA, which applied a heuristic to select transactions for data modification. AARHIL applies two heuristics to select appropriate victim items and transactions for data modification using the combination of methods (i) and (iii). Moreover, AARHIL applies a heuristic to compute weight of transactions and sort them before modifying, so it attained the lower lost rules compared with HCSRIL.
Lost rules comparison.
Figure 2 indicates that these algorithms produce very few ghost rules. The AARHIL, HCSRIL, WSDA, and JA algorithms did not create ghost rules, whereas the number of ghost rules created by MaxMin2 is more than 0.4 percent.
Ghost rules comparison.
There was no false rule produced by these algorithms when dealing with the selected sensitive association rules for every case of dataset.
Figure 3 shows the comparison of these algorithms on the aspect of accuracy of released dataset. With two rules for hiding being selected, the accuracy of released dataset was very high. This means the hiding process caused a few changes in the released dataset compared with the original dataset. Moreover, by modifying the same number of data items, AARHIL and HCSRIL algorithms achieved the same accuracy, but this accuracy is highest compared to other algorithms in every dataset.
Accuracy comparison.
The execution times for these algorithms are shown in Figure 4. These algorithms required only 2000 seconds for running 88,162 transactions of 16,469 items, whereas the MaxMin2 algorithm required more times compared with the others. The difference between execution times of HCSRIL and JA algorithms is not significant. By reducing the time to access database and the time to compute GL, AARHIL achieved lowest CPU-Time.
Required execution time.
Table 5 shows the performance of these algorithms in the average case. Accordingly, AARHIL achieved the best results in the side effects minimization. On average, AARHIL achieved 4% lost rule compared with 11% of HCSRIL, 19% of WSDA, 24% of JA, and 32% of MaxMin2. These algorithms attained the same performance in the remaining side effects, whereas MaxMin2 produced 0.38 percent of ghost rules. Moreover, AARHIL achieved the lowest CPU-Time compared with the others.
Average side effect and CPU-Time produced by AARHIL, WSDA, and MaxMin2.
Algorithm
Lost rule (%)
False rule (%)
Ghost rule (%)
Accuracy (%)
CPU-Time (s)
AARHIL
4.20
0
0
99.74
55
HCSRIL
11.09
0
0
99.74
228
WSDA
19.18
0
0
99.59
692
JA
24.48
0
0
99.40
191
MaxMin2
32.37
0
0.38
99.35
1219
In summary, the results show that the AARHIL algorithm outperforms the HCSRIL, JA, MaxMin2, and WSDA in minimizing the side effects and computational complexity. Hence, this algorithm is suitable for application in the real world.
7. Conclusion
This study introduced in detail the theories of intersection lattice of frequent itemsets, denoted by L(𝒟,σ), and proposed an improvement to minimize size effects and complexity of intersection lattice-based approach. In order to minimize side effects, two heuristics are formulated relying on the properties of the generating set GL of L(𝒟,σ). The first heuristic aims at specifying the victim item for data distortions such that the modification causes the least impacts on L(𝒟,σ). The improvement is applied in the second heuristic that computes the weight to each transaction relying on their safety degree, the number of sensitive rules, and the number of nonsensitive association rules contained by that transaction. Removing the victim item from the minimum number of specified transactions that have the highest weight contributes to achieving the lowest lost rules and highest accuracy and to restricting ghost rules. The experimental results showed that the proposed algorithm, AARHIL, achieved minimum side effects and CPU-Time compared with HCSRIL, MaxMin2, WSDA, and JA algorithms in the context of hiding a specified set of sensitive association rules.
Acknowledgment
The authors wish to acknowledge the support of the Department of Computer Science, Faculty of Science, Khon Kaen University Publication Clinic, Research and Technology Transfer Affairs, Khon Kaen University, for their assistance.
XiJ.GaoZ.NiuS.DingT.NingG.A hybrid algorithm of traffic accident data mining on cause analysis20132013830262710.1155/2013/302627DongM.A tutorial on nonlinear time-series data mining in engineering asset health and reliability prediction: concepts, models, and algorithms20102010222-s2.0-7795446669210.1155/2010/175936175936GokceogluC.NefesliogluH. A.SezerE.BozkirA. S.DumanT. Y.Assessment of landslide susceptibility by decision trees in the metropolitan area of Istanbul, Turkey20102010152-s2.0-7995327815510.1155/2010/901095901095AdamZ. R.2000DAI-AGILEZhangD. Y.ZengY.WangL.LiH.GengY.Modeling and evaluating information leakage caused by inferences in supply chains20116233513632-s2.0-7995244677710.1016/j.compind.2010.10.002RonaldK. I.2005Plantation, Fla, USAJ.ROSS PublishingGkoulalas-DivanisA.VerykiosV. S.2011LaVergne, Tenn, USASunX.YuP. S.Hiding sensitive frequent itemsets by a border-based approach2007117494MoustakidesG. V.VerykiosV. S.A MaxMin approach for hiding frequent itemsets200865175892-s2.0-3974913511310.1016/j.datak.2007.06.012MenonS.SarkarS.MukherjeeS.Maximizing accuracy of shared databases when concealing sensitive patterns20051632562702-s2.0-2584445664810.1287/isre.1050.0056Gkoulalas-DivanisA.VerykiosV. S.An integer programming approach for frequent itemset hidingProceedings of the 15th ACM Conference on Information and Knowledge Management (CIKM '06)November 20067487572-s2.0-3454764138110.1145/1183614.1183721AtallahM.BertinoE.ElmagarmidA.IbrahimM.VerykiosV.Disclosure limitation of sensitive rulesProceedings of the Workshop on Knowledge and Data Engineering Exchange1999VerykiosV. S.ElmagarmidA. K.BertinoE.SayginY.DasseniE.Association rule hiding20041644344472-s2.0-214275447810.1109/TKDE.2004.1269668OliveiraS. R. M.ZaïaneO. R.A unified framework for protecting sensitive association rules in business collaboration2006132472872-s2.0-33646056672JainY. K.YadavV. K.PandayG. S.An efficient association rule hiding algorithm for privacy preserving data mining hiding20113727922798WangE. T.LeeG.An efficient sanitization algorithm for balancing information privacy and knowledge discovery in association patterns mining20086534634842-s2.0-4304909193210.1016/j.datak.2007.12.005PontikakisE. D.TsitsonisA. A.VerykiosV. S.An experimental study of distortion-based techniques for association rule hiding2004144325339GulwaniP.Association rule hiding by positions swapping of support and confidence201245461JainD.SinhalA.GuptaN.Hiding sensitive association rules without altering the support of sensitive item(s)2012327584SayginY.VerykiosV. S.CliftonC.Using unknowns to prevent discovery of association rules200130445542-s2.0-0344990562HongT.-P.LinC.-W.ChangC.-C.WangS.-L.Hiding sensitive itemsets by inserting dummy transactionsProceedings of the IEEE International Conference on Granular Computing (GrC '11)November 20112462492-s2.0-8485676971610.1109/GRC.2011.6122602VerykiosV. S.PontikakisE. D.TheodoridisY.ChangL.Efficient algorithms for distortion and blocking techniques in association rule hiding2007221851042-s2.0-3454769771610.1007/s10619-007-7013-0HaiL. Q.SomjitA.A Conceptual framework for privacy preserving of association rule mining in E-commerceProceedings of the 7th IEEE Conference on Industrial Electronics and Applications (ICIEA '12)201219992003HaiL. Q.SomjitA.HuyN. X.NgamnijA.Association rule hiding in risk management for retail supply chain collaboration201364776784HaiL. Q.SomjitA.NgamnijA.Association rule hiding based on distance and intersection latticeProceedings of the 4th International Conference on Computer Technology and Development (ICCTD '12)2012227231ZhangC.ZhangS.20022307New York, NY, USASpringerLecture Notes in Artificial IntelligenceGrätzerG.2011Springer Basel AGxxx+6132010 Mathematics Subject Classification10.1007/978-3-0348-0018-1MR2768581ZBL1249.06024WangS.-L.LeeY.-H.BillisS.JafariA.Hiding sensitive items in privacy preserving association rule miningProceedings of the IEEE International Conference on Systems, Man and Cybernetics (SMC '04)October 2004323932442-s2.0-1574438409010.1109/ICSMC.2004.1400839BrijsT.Retail market basket data setWorkshop on Frequent Itemset Mining Implementations (FIMI '03), 2003, http://fimi.ua.ac.be/data/