Reducing Side Effects of Hiding Sensitive Itemsets in Privacy Preserving Data Mining

Data mining is traditionally adopted to retrieve and analyze knowledge from large amounts of data. Private or confidential data may be sanitized or suppressed before it is shared or published in public. Privacy preserving data mining (PPDM) has thus become an important issue in recent years. The most general way of PPDM is to sanitize the database to hide the sensitive information. In this paper, a novel hiding-missing-artificial utility (HMAU) algorithm is proposed to hide sensitive itemsets through transaction deletion. The transaction with the maximal ratio of sensitive to nonsensitive one is thus selected to be entirely deleted. Three side effects of hiding failures, missing itemsets, and artificial itemsets are considered to evaluate whether the transactions are required to be deleted for hiding sensitive itemsets. Three weights are also assigned as the importance to three factors, which can be set according to the requirement of users. Experiments are then conducted to show the performance of the proposed algorithm in execution time, number of deleted transactions, and number of side effects.


Introduction
With the rapid growth of data mining technologies in recent years, useful information can be easily mined to aid mangers or decision-makers for making efficient decisions or strategies. The derived knowledge can be simply classified into association rules [1][2][3][4][5], sequential patterns [6][7][8], classification [9,10], clustering [11,12], and utility mining [13][14][15][16], among others. Among them, association-rule mining is the most commonly used to determine the relationships of purchased items in large datasets.
Traditional data mining techniques analyze database to find potential relations among items. Some applications require protection against the disclosure of private, confidential, or secure data. Privacy preserving data mining (PPDM) [17] was thus proposed to reduce privacy threats by hiding sensitive information while allowing required information to be mined from databases. Privacy information includes some personal or confidential information in business, such as social security numbers, home address, credit card numbers, credit ratings, purchasing behavior, and best-selling commodity. In PPDM, data sanitization is generally used to hide sensitive information with the minimal side effects for keeping the original database as authentic as possible. The intuitive way of data sanitization to hide sensitive information is directly to delete sensitive information from amounts of data. Three side effects of hiding failure, missing cost, and artificial cost are then generated in data sanitization process but most approaches are designed to partially evaluate the side effects. Infrequent itemset is, however, not considered in the evaluation process, thus raising the probability of artificial itemsets caused. Besides, the differences between 2 The Scientific World Journal the minimum support threshold and the frequencies of the itemsets to be hidden are not considered in the above approaches.
In this paper, a hiding-missing-artificial utility (HMAU) algorithm is proposed for evaluating the processed transactions to determine whether they are required to be deleted for hiding sensitive itemsets by considering three dimensions as hiding failure dimension (HFD), missing itemset dimension (MID), and artificial itemset dimension (AID). The weight of each dimension in evaluation process can be adjusted by users. Experimental results showed that the proposed HMAU algorithm has good performance in execution time and the number of deleted transactions. Besides, the proposed algorithm can thus generate minimal side effects of three factors compared to the past algorithm for transaction deletion to hide the sensitive itemsets.
This paper is organized as follows. Some related works are reviewed in Section 2, including the data mining techniques, the privacy preserving data mining, and the evaluated criteria of PPDM. The proposed HMAU algorithm to hide the sensitive itemsets for transaction deletion is stated in Section 3. An illustrated example of the proposed HMAU algorithm is given in Section 4 step by step. Experiments are conducted in Section 5. Conclusion and future works are mentioned Section 6.

Review of Related Works
In this section, privacy preserving data mining (PPDM) techniques and evaluated criteria of PPDM are respectively reviewed.

Privacy Preserving Data Mining
Techniques. Data mining is used to extract useful rules from large amounts of data. Agrawal and Srikant proposed Apriori algorithm to mine association rules in two phases to firstly generate the frequent itemsets and secondly derive the association rules [3]. Han et al. then proposed the Frequent-Pattern-tree (FP-tree) structure for efficiently mining association rules without generation of candidate itemsets [18]. The FP-tree was used to compress a database into a tree structure which stored only large items. It was condensed and complete for finding all the frequent patterns. The construction process was executed tuple by tuple, from the first transaction to the last one. After that, a recursive mining procedure called FP-Growth was executed to derive frequent patterns from the FP-tree.
Through various data mining techniques, information can thus be efficiently discovered. The misuse of these techniques may, however, lead to privacy concerns and security problems. Privacy preserving data mining (PPDM) has thus become a critical issue for hiding private, confidential, or secure information. Most commonly, the original database is sanitized for hiding sensitive information [19][20][21].
In data sanitization, it is intuitive to directly delete sensitive data for hiding sensitive information. Leary found that data mining techniques can pose security and privacy threats [22]. Amiri proposed the aggregate, disaggregate, and hybrid approaches to, respectively, determine whether the transactions or the items are to be deleted for hiding sensitive information [23]. The approaches considered the ratio of sensitive itemsets to nonsensitive frequent itemsets to evaluate the side effects of hiding failures and missing itemsets. Oliveira and Zaïane designed the sliding window algorithm (SWA) [24], in which the victim item with the highest frequency in the sensitive rules related to the current sensitive transaction is selected. Victim items are removed from the sensitive transaction until the disclosure threshold equals 0. Hong et al. proposed a lattice-based algorithm to hide the sensitive information through itemset deletion by a lattice structure to speed up the sanitization process [25]. All the sensitive itemsets are firstly used to build the lattice structure. The sensitive itemsets are then gradually deleted bottom-up form the lowest levels to the highest ones until the frequencies of the sensitive itemsets are lower than the minimum support threshold. Different strategies for hiding sensitive itemsets are still designed in progress to find better results considering of side effects and the dissimilarity of database [21,[26][27][28][29][30].

Evaluation Criteria.
In data sanitization, the primary goal is to hide the sensitive information with minimal influences on databases. Three side effects of hiding failures, missing itemsets, and artificial itemsets are used to evaluate the performance of data sanitization. for data distortion [28,31,32] of sensitive itemsets in PPDM. The relationships between the side effects and mined itemsets of the original database and sanitized one are shown in Figure 1.
In Figure 1, represents the frequent itemsets mined from the original database, represents the frequent itemsets mined from the sanitized database, and represents the sensitive itemsets that should be hidden. The part is concerned as hiding failures that fail to hide the sensitive itemsets. Thus, is the intersection of and (= ∩ ). part is concerned as missing itemsets that mistakenly to delete the nonsensitive frequent rules. Thus, is the difference between , , and (= − − ). part is concerned as artificial itemsets which is unexpectedly generated. Thus, is The Scientific World Journal 3 the difference between and (= − ). In PPDM, it is intuitive to delete transactions with sensitive itemsets in the sanitization process. In this paper, , , and with adjustable weights are considered to evaluate whether the processed transactions are required to be deleted. Besides the above side effects, the number of deleted transactions or items is also a criterion to evaluate the data distortion [32,33].

Proposed Hiding-Missing-Artificial
Utility Algorithm

Definition of Formulas.
Data sanitization is the most common way to protect sensitive knowledge from disclosure in PPDM. To avoid the side effects of hiding failures, missing itemsets, and artificial itemsets, minimal distortion of the databases is thus necessary. In this paper, a hiding-missingartificial utility (HMAU) algorithm is proposed to hide sensitive itemsets through transaction deletion. Three dimensions of hiding failure dimension (HFD), missing itemset dimension (MID), and artificial itemset dimension (AID) are thus concerned to evaluate whether the transactions are required to be deleted for hiding the sensitive itemsets. The transactions with any of the sensitive itemset are first evaluated by the designed algorithm to find the minimal HMAU values among transactions, The transaction with minimal HMAU value will be directly removed from the database. The procedure is thus repeated until all sensitive itemsets are hidden. In order to avoid exposing the already hidden sensitive itemsets again, the minimum count is dynamically updated during the deletion procedure. The value of each dimension is set from 0 to 1 (0 < value ≤ 1). In the proposed formulas, the differences between minimum support threshold and the frequencies of the sensitive itemsets are thus considered to evaluate whether the transactions are required to be deleted instead of only the presence of the itemsets in the transactions.
First, the HFD is used to evaluate the hiding failures of each processed transaction in the sanitization process. When a processed transaction contains a sensitive itemset ℎ , the HFD value of the processed transaction is calculated as where is defined as the percentage of the minimum support threshold, sensitive itemset hs is from the set of sensitive itemsets HS, MAX HS is the maximal count of the sensitive itemsets in the set of sensitive itemsets HS, | | is the number of transactions in the original database , and freq(hs ) is the occurrence frequency of the sensitive itemset hs . Second, the MID is used to evaluate the itemsets of each processed transaction in the sanitization process. When a processed transaction contains a frequent itemset fi , the MID value of the processed transaction is calculated as where an itemset fi is a frequent itemset from the set of large (frequent) itemsets FI, MAX FI is the maximal count of the large itemsets in the set of FI, and freq(fi ) is the occurrence frequency of the large itemset fi . Third, the AID is used to evaluate the artificial itemsets of each processed transaction in the sanitization process. In AID, only the small 1-itemsets are considered in the sanitization process since it is a nontrivial task to keep all infrequent itemsets. When a processed transaction contains a small 1-itemset si , the AID value of the processed transaction is calculated as where a small 1-itemset si is from the set of small 1-itemsets SI 1 , MIN SI 1 is the minimal count of the small 1-itemsets in the set of SI 1 , and freq(si ) is the occurrence frequency of the small 1-itemset si .
In this paper, a risky bound is designed to speed up the execution time of the proposed HMAU algorithm by avoiding the evaluation of all large itemsets and small 1itemsets by considering MID and AID. A parameter is set as the percentage used to find the upper and lower boundaries of the minimum support threshold. Only the large itemsets and infrequent 1-itemsets within the boundaries are used to determine whether the processed transactions are required to be deleted. For the large itemsets, the minimum support threshold is set as the lower boundary, and the upper boundary is set as where | | is the number of transactions in the original database , is the minimum support threshold, is the risky bound, and freq(fi ) is the occurrence frequency of the large itemset fi . For small 1-itemsets, the minimum support threshold is set as the upper boundary, and the lower boundary is set as where freq(si ) is the occurrence frequency of the small 1itemset si . The flowchart of the proposed HMAU algorithm is depicted in Figure 2. Table 1.

Notation. See
Details of the proposed HMAU algorithm are illustrated as follows.

Proposed HMAU Algorithm.
Input. This includes an original database , a minimum support threshold ratio , a risky bound , a set of large (frequent) itemsets FI = {fi 1 , fi 2 , . . . , fi }, a set of small  Output. This includes a sanitized database * with no sensitive information.
Step 1. Select the transactions to form a projected database , where each transaction in consists of sensitive itemsets hs within it, where 1 ≤ ≤ .
Step 2. Process each frequent itemset fi in the set of FI to determine whether its frequency satisfies the condition freq(fi ) ≤ ⌈⌈| | × ⌉ × (1 + )⌉, where | | is the number of transactions in the original database and freq(fi ) is the occurrence frequency of the large itemset fi . Put the fi that do not satisfy the condition into the set of FI tmp .
Step 3. Process each small 1-itemset si in the set of SI 1 to determine whether its frequency satisfies the condition freq(si ) ≥ ⌊⌈| | × ⌉ × (1 − )⌋, where freq(si ) is the occurrence frequency of the small 1-itemset si . Put the si that do not satisfy the condition into the set of SI 1 tmp .
Step 4. Calculate the maximal count (MAX HS ) of the sensitive itemsets hs in the set of HS as where freq(hs ) is the occurrence frequency of the sensitive itemset hs in the set of HS.
Step 5. Calculate the HFD of each transaction . Do the following substeps. Step 6. Calculate the maximal count (MAX FI ) of the large itemsets fi in the set of FI as The Scientific World Journal 5 The temporary set of sensitive itemsets outside the boundary FI tmp The temporary set of large itemsets outside the boundary SI The occurrence frequency of the large itemset fi MIN SI 1 The minimal count of the small 1-itemsets in the set of SI 1 freq(si ) The occurrence frequency of the small 1-itemset si The weights for HFD, MID, and AID, in which 0 < ≤ 1 HMAU The utility value used to determine whether the processed transactions should be deleted Step 7. Calculate the MID of each transaction . Do the following substeps.
Substep 7.3. Normalize the MID for all transactions in .
Step 8. Calculate the minimal count (MIN SI 1 ) of the small 1itemsets si in the set of SI 1 as Step 9. Calculate the AID of each transaction . Do the following substeps. Substep 9.1. Calculate the AID of each small 1-itemset within as Substep 9.2. Sum the AIDs of small 1-itemsets si within as 6 The Scientific World Journal  Step 10. Calculate the HMAU for HFD, MID, and AID of each transaction as (15) where 1 , 2 , and 3 are the predefined weights by users.
Step 13. Update the occurrence frequencies of all sensitive itemsets in the sets of HS and HS tmp . Put hs into the set of HS tmp if freq(hs ) < minimum count (= ⌈| | × ⌉), and put hs into the set of HS otherwise.
Step 14. Update the occurrence frequencies of all large itemsets in the sets of FI and FI tmp . Put fi into the set of FI tmp if freq(fi ) < minimum count (= ⌈| | × ⌉), and put fi into the set of FI otherwise.
Step 15. Update the occurrence frequencies of all small 1itemsets in the sets of SI 1 and SI 1 tmp . Put si into the set of SI 1 tmp if freq(si ) ≥ minimum count (= ⌈| | × ⌉), and put si into the set of SI 1 otherwise.
Step 16. Repeat Step 2 to Step 15 until the set of HS is empty (|HS| = 0).

An Illustrated Example
In this section, an example is used to illustrate the proposed algorithm step by step. Consider a database with 10 transactions (tuples) and 6 items (denoted as to ) shown in Table 2. Each transaction can be considered a set of purchased items in a trade. The minimum support threshold is initially set at 40%, and the risky bound is set at 10%. A set of sensitive itemsets, HS = { : 6, : 4}, is considered to be hidden by the sanitization process.
Based on an Apriori-like approach [3], the large (frequent) itemsets and small 1-itemsets are mined. The results are, respectively, shown in Tables 3 and 4.
The proposed algorithm then proceeds as follows to sanitize the database for hiding all sensitive itemsets in HS.
Step 1. The transactions in are selected with any of the sensitive itemsets in HS. In this example, the transactions 1, 3, 6, 7, 8, and 10 are selected to form the database shown in Table 5.
Step 4. The maximal count (MAX HS ) among the sensitive itemsets in the set of HS is then calculated. In this example, the maximal count of the sensitive itemsets { } and { } is calculated as MAX HS = max{6, 4} = 6.
Step 5. The HFD of each transaction is calculated to evaluate the side effects of hiding failures of the processed transaction.
The HFDs for all transactions are then normalized as shown in Table 7.
Step 6. The maximal count (MAX FI ) among the large itemsets in the set of FI is then calculated. In this example, the large itemsets are { , , , , }, and the MAX FI is calculated as MAX FI = max{5, 5, 4, 5, 5} (=5).
Step 7. The MID of each transaction is calculated to evaluate the side effects of missing itemsets of the processed transaction. The frequent item { } in transaction 7 is used as an example to illustrate the steps. According to formula (2) Table 8.
The MIDs for all transactions are then normalized as shown in Table 9.
Step 8. The minimal count (MIN SI 1 ) among the small 1itemsets in the set of SI 1 is then calculated. In this example, the small 1-itemset has only { }, and the minimal count of the small 1-itemset is calculated as MIN SI 1 = min{3} =3.
Step 9. The AID of each transaction is calculated to evaluate the side effects of artificial itemsets of the processed transaction. Small 1-itemset { } in transaction 7 is used as an example to illustrate the steps. According to formula (3), the AID of the small 1-itemset { } is calculated as AID 7 ( ) = (3 − 3 + 1)/(4 − 3) = 1; since there is only one itemset in the set of SI 1 , no other calculations are necessary. The AID of transaction 7 is calculated as AID 7 = 1/(1 + 1) = 0.5. The other transactions are processed in the same way. The results are shown in Table 10.
The AIDs for all transactions are then normalized as shown in Table 11.
Step 10. The three dimensions for evaluating the selected transactions are then organized as in Table 12. The weights of hiding failures, missing itemsets, and artificial itemsets are, respectively, set to 0.5, 0.4, and 0.1. Note that these values can be defined by users to decide the importance among the dimensions. In this example, the HMAU of transaction 7 is calculated as HMAU 7 = 0.5 × 0.57 + 0.4 × 1 + 0.1 × 0.5 (= 0.735) . (16) The other transactions are processed in the same way. The results are shown in the last column of Table 12.
Step 11. The selected transactions in Table 12 are then evaluated to find a transaction with the minimal HMAU value.

8
The Scientific World Journal     In this example, transaction 8 has the minimal value and is directly removed from Table 12.
Step 13. The occurrence frequencies of all sensitive itemsets in the sets of HS and HS tmp are, respectively, updated. Since the original database with transaction 8 consisted of the sensitive itemsets { , }, which was deleted in Step 11, the counts of { , } in the set of HS are, respectively, updated as { } (= 6 − 1) (= 5) and { } (= 4 − 1) (= 3). In this example, the set of HS tmp is empty, so there is nothing to be done in this step. After the updating process, the itemset { } is put into the set of HS tmp since its count is below the minimum count (3 < 4).
Step 14. The occurrence frequencies of all large itemsets in the sets of FI and FI tmp are, respectively, updated. . After the updating process, the itemset { } is put into the set of FI tmp since its count is below the minimum count (3 < 4).
Step 15. The occurrence frequencies of all small 1-itemsets in the sets of SI 1 and SI 1 tmp are, respectively, updated. Since the original database with transaction 8 did not consist of any of the small 1-itemsets in SI 1 and SI 1 tmp , nothing is done in this step.
Step 16. In this example, the sensitive itemset { } is already hidden, but the occurrence frequency of sensitive itemset { } is larger than the minimum count. Steps 2 to 15 are repeated until the set of sensitive itemsets HS is empty (|HS| = 0). After all Steps are processed, the sanitized database is obtained as shown in Table 13.
Comparing the original database and the sanitized one, transactions 1, 3, 6, and 8 are removed from the original database, and the minimum count is updated as 3. The updated frequent itemsets of the sanitized database are shown in Table 14.
Comparing the large itemsets in Table 3, the sensitive itemsets { } and { } are hidden and no artificial itemset is generated. Three itemsets, { , , }, are, however, missing itemsets of the sanitized database. In this example, the side effects of hiding failures, missing itemsets, and artificial itemsets are 0, 3, and 0, respectively.

Experimental Results
Experiments are conducted to show the performance of the proposed HMAU algorithm compared to that of the aggregate algorithm [23] for hiding sensitive itemsets through transaction deletion. The experiments were coded in C++ and performed on a personal computer with an Intel Core i7-2600 processor at 3.40 GHz and 4 GB of RAM running 64bit Microsoft Windows 7. The real database BMS-WebView-1 [34] and a synthetic database (T7I7N200D20K) [35] from IBM data generator in which symbolizes the average length of the transactions, symbolizes the average maximum size The Scientific World Journal 9  of frequent itemsets, symbolizes the number of differential items, and symbolizes the size of database were used in the experiments. The details of the two databases are shown in Table 15.
For the BMS-WebView-1 database, the minimum support thresholds were, respectively, set at 1% and 2% to evaluate the performance of the proposed approach, and the percentages of sensitive itemsets were sequentially set from 5% to 25% of the number of frequent itemsets in 5% increments. In the experiments, the weights of HFD, MID, and AID in the proposed algorithm were, respectively, set at 0.5, 0.4, and 0.1.
For the T7I7N200D20K database, the minimum support thresholds were, respectively, set at 1.5% and 3%, and the percentages of sensitive itemsets were sequentially set at 2.5% to 12.5% of the number of frequent itemsets in 2.5% increments. In the experiments, the weights of HFD, MID, and AID in the proposed algorithm were, respectively, set at 0.5, 0.4, and 0.1. Figure 3 shows the execution time of two algorithms in BMS-Web-View-1 database. Different minimum support thresholds of two algorithms are then compared in various sensitivity percentages of the frequent itemsets.

Comparisons of Execution Time.
The execution time of the proposed HMAU algorithm is faster than those of the aggregate algorithm whether the minimum support threshold is set at 1% or 2%. Experiment is then conducted in T7I7N200D20K database and the results are shown in Figure 4.
From Figures 3 and 4, it is obvious to see that the proposed HMAU algorithm is faster than those of the aggregate method in two different databases.

Comparisons of Number of Deleted Transactions.
Experiments were also conducted to evaluate the number of deleted transactions of the proposed algorithm in two different databases. For the BMS-WebView-1 database, the results are shown in Figure 5.
From Figure 5, it is obvious to see that the proposed HMAU algorithm deletes fewer transactions than the aggregate algorithm whether the minimum support threshold is set at 1% or 2% in BMS-WebView-1 database, thus achieving lower data distortion. For the T7I7N200D20K database, the results are shown in Figure 6.
From Figure 6, it is obvious to see that when the sensitive itemsets were set at 10% of the frequent itemsets with 1.5% minimum support threshold in T7I7N200D20K database, the proposed HMAU algorithm produced more transactions to be deleted for hiding sensitive itemsets. Since the proposed HMAU algorithm considers the three dimensions together, the selected transactions for deletion may consist of fewer large transactions rather than many sensitive itemsets.

Comparisons of Side Effects.
Three side effects are then compared to show the performance of the proposed algorithm in two different databases. The side effects of hiding failures, missing itemsets, and artificial itemsets are, respectively, symbolized as , , and . In Table 16, it can be seen that when the minimum support threshold was set at 1%, the proposed HMAU algorithm produces no side effects whereas the aggregate algorithm produces some artificial itemsets since the criteria of artificial itemsets are not considered in aggregate algorithm. Both the two algorithms produce no side effects when the minimum support threshold was set at 2%. The results to evaluate the side effects of the proposed HMAU algorithm in T7I7N200D20K database are shown in Table 17.
From Table 17, it is obvious to see that when the minimum support threshold was set at 1.5%, the proposed HMAU algorithm produces fewer artificial itemsets and missing itemsets than the aggregate algorithm for various sensitivity percentages of the frequent itemsets. The proposed HMAU algorithm produces no side effects at 3% minimum support threshold whereas the aggregate algorithm produces some artificial itemsets.
To summarize the above results for BMS-WebView-1 and T7I7N200D20K databases, the proposed HMAU algorithm outperforms the aggregate algorithm in terms of the execution time, the number of deleted transactions, and the number of side effects.

Conclusion and Future Works
In this paper, the HMAU algorithm is proposed for hiding sensitive itemsets in data sanitization process by reducing the side effects through transaction deletion. The formulas of three dimensions as HFD, MID, and AID are defined to    In the future, the sensitive itemsets to be hidden can be extended to the sensitive association rules to be hidden. More considerations are necessary to be concerned to decrease not only the supports of sensitive itemsets but also the confidence of sensitive association rules. Other distortion approaches such as the noise addition and data modification are also the important issues to hide the sensitive information in PPDM.