Privacy Protection Method for Multiple Sensitive Attributes Based on Strong Rule

At present, most studies on data publishing only considered single sensitive attribute, and the works onmultiple sensitive attributes are still few. And almost all the existing studies on multiple sensitive attributes had not taken the inherent relationship between sensitive attributes into account, so that adversary can use the background knowledge about this relationship to attack the privacy of users. This paper presents an attack model with the association rules between the sensitive attributes and, accordingly, presents a data publication for multiple sensitive attributes. Through proof and analysis, the new model can prevent adversary from using the background knowledge about association rules to attack privacy, and it is able to get high-quality released information. At last, this paper verifies the above conclusion with experiments.


Introduction
Data publishing is widely used in the field of information sharing and scientific research, and how to ensure the availability of data and the security of user's privacy is the core content of studies.The data tables usually contain three types of attribute: identifier, which can identify the individual uniquely, for example, the social security number (SSN); quasi-identifier (QI), which cannot identify the individual uniquely but can provide individual information, for example, the country and age attributes; sensitive attribute (), which is usually related to the privacy of users, for example, the disease attribute.The sensitive attribute needs to be protected in the published table.A series of data publishing methods [1][2][3][4][5][6][7][8][9][10][11] are presented, in order to prevent adversary from linking quasi-identifying attributes with public available dataset to reveal personal identity, and paper [1,2] presents -anonymity method, which partitions the table into equivalence groups (EG).Each equivalence group consists of at least  different records, and -anonymity generalizes the quasi-identified attributes of records in the same equivalence groups.But the -anonymity is faced with the risk of sensitive attribute disclosure due to lack of diversity.In order to solve this problem, [4] proposals -diversity, which not only can satisfy the -anonymity but also requires that there are at least  different sensitive attribute values in each equivalence group.In addition, privacy protection methods in [5][6][7][8][9][10][11] also improve the -anonymity from different angles, respectively.But most of them only consider the situation of single sensitive attribute, so some privacy protection methods for multiple sensitive attributes are presented.Papers [12,13] attempt to directly use -diversity for multiple sensitive attributes, which result in a lot of information loss.Paper [14] protects users' privacy through disturbing the order of sensitive attributes values in the same equivalence groups.But this method needs to add fake sensitive attribute values to EG, and it breaks the relationship between sensitive attributes, so useful relationships cannot be provided.The publication method in paper [15] can prevent adversary using nonmembership knowledge to attack data table, but its strict grouping condition will result in excessive information loss.According to the theory of paper [16], we can know that the publication methods of [12,13,15] cannot ensure good diversity and are vulnerable to background-join attack, so paper [16] divides the raw data table into several projected tables, puts the sensitive attributes which have strong dependency into the same projected table, and makes each projected table satisfy -closeness at last.But this method ignores the association rules with high confidence, and the adversary can use the knowledge of these rules to get the privacy of users.In order to avoid the suppression of records, paper [17] presents a new publication method, which chooses to generalize each sensitive attribute, respectively.But like other privacy protection methods for multiple sensitive attributes, paper [17] ignores the inherent relationship between sensitive attributes, so the adversary can use the related background knowledge to attack privacy, and it is difficult for users to find valuable relationships from its released tables.In order to resolve this problem, this paper introduces the association rule into the design of privacy protection method and presents an improved data publishing model for multiple sensitive attributes based on the work of [17].

The Main Work of This Paper
Most existing researches on privacy preserving technology for multiple sensitive attributes have not taken the inherent relationship between sensitive attributes into account, so adversary sometimes can use the related background knowledge to attack the privacy of users, and some valuable relationships cannot be provided by released tables.Faced with this situation, we introduce the association rules into the research on data publishing.The main works of this paper are as follows.
(1) It analyses the data publishing model-Rating in paper [17], points out its weakness, and presents an attack method with strong rule (Section 3).
(2) It takes relationship between different sensitive attributes into account, presents a mixed data publishing model based on Rating, then improves the algorithm of Rating, makes it more effective, and at last analyses and proves the correctness of the algorithm and the security of the mixed data publication model (Section 4).
(3) It proves that the new model has better quality of released information than Rating in theory (Section 5).
(4) Through the experiments, it verifies that the new data publishing model can provide better privacy, and it is able to preserve valuable relationships between sensitive attributes in released tables (Section 6).

The Analysis of Rating
This section will introduce Rating [17] model briefly and present an attack method which can use strong rules to get users' privacy from Rating.    1.     3. AT displays the name of   ID  in [17], and users can use the   ID  name to get the corresponding   ID  value in IDT.For convenient description, AT displays the   ID  value directly in this paper.

The Weakness of Rating.
Rating takes the generalization strategy which is suitable for multiple sensitive attributes and can improve the quality of released information.But Rating ignores the relationship between different sensitive attributes, and sometimes the users' privacy disclosure may happen.
Compared with data publishing for single sensitive attribute, multiple sensitive attributes not only mean the increase of the sensitive attributes' number but also need more effective method to decrease the information loss, and adversary may use the relationship between sensitive attributes to attack privacy of user.So when designing the data publishing model for multiple sensitive attributes, the association rules should be taken into account.An attack method will be presented as follows.
Assuming the original table can basically reflect the real world, if Bob knows his neighbor Alice is in Table 3 and Alice is 23 years old and Chinese, Bob is sure that  1 is Alice according to the quasi-identifier.Through Bob's common sense of life or previous investigations, he knows that if event  1 happens to someone, the probability of occurrence of  1 is usually not less than 75%; namely, ( 1 |  1 ) ≥ 75%.Then through the ID table Bob knows in this table that there are four people whose attribute  1 values are  1 , so in these four people, there are at least three whose attribute  2 values must be  1 .That is to say, there are at least three people whose  1 values are  1 while their  2 values are  1 .
So Bob begins to analyze attribute table, and he finds out that there are 8 records whose  1 value may be  1 , and these 8 records are  1 ,  5 ,  6 ,  7 ,  8 ,  9 ,  10 ,  12 , but in these records, only 3 records'  2 values may be  1 , and these three records are  1 ,  5 ,  6 , respectively.Then Bob can be sure that  1 . 1 =  5 . 1 =  6 . 1 =  1 ,  1 . 2 =  5 . 2 =  6 . 2 =  1 .And  1 is Alice, so the privacy of Alice is disclosed.The above is an example of attack with association rules.Because Rating has not taken the correlation between sensitive attributes into account, if adversary masters corresponding background knowledge, the privacy of user may be disclosed.So in the next section, an improved model will be presented.

The Data Publishing for Multiple Sensitive Attributes Based on Strong Rules
In Usually we set minimum support degree threshold (min support) and the minimum confidence threshold (min confidence); if an association rule satisfies both these two thresholds, the rule is meaningful.In this paper, as long as one sensitive attribute value V appears once, adversary may use it to attack the privacy of users, so the min support is set to 1.And the min confidence is set by users.
Usually the strong rule's confidence is relatively higher, and adversary may use the strong rule to attack users' privacy if adversary has the related background knowledge.On the other hand, we hope to preserve information of strong rule in the released data, because it is valuable.So we need to put the records containing strong rules into table SAC and process table SAC with consideration for strong rule.Record containing strong rules means that, for record , if ∃.  , .  (1 ≤ ,  ≤ ), .  ⇒ .  is a strong rule, so record  contains strong rule.And for a value V, if there is a strong rule that  ⇒  satisfies V =  or V = , we call V strong value.

Partition Table.
Assuming the association rules in original data table are close to the situation of the real world, first use classic association rule mining algorithm-Apriori [18] to find out all the strong rules in table, according to min confidence set by user (line 1).Then find a sensitive attribute  that has the largest number of strong values, put all other sensitive attributes into  set ∼ , and if a record contains strong value in  set ∼ , add it to table SAC (line 2).At last delete the records contained by SAC in original table, and get table IR (line 3), so the original table  is divided into two tables: SAC and IR.Obviously, the records which all contain strong association rules are in SAC, and all the strong values in IR belong to the same sensitive attribute, so there are no probability tilts caused by strong association rules in IR (please see Section 4.2.2).

Algorithm 7 (partition table). Input: original data table 𝑇, min confidence
Output: Table SAC, Table IR (1) Accoding to min confidence find out all strong rules with Apriori.
(2) Find the records which all contain strong values in  set ∼ , and put them into table SAC. (

Partition Sensitive Attributes. After partition table, begin to process the table SAC.
In order to preserve the information of strong rules, cluster the sensitive attributes.First, we need to define the distance between sensitive attributes.
Definition 8 (distance between sensitive attributes).Given two sensitive attributes   ,   (1 ≤ ,  ≤ ), the distance between the two sensitive attributes can be defined as distance (  ,   ) = { 1, there are no strong rules between   and   , 0, there are strong rules between   and   . ( Here, if ∃V  ∈ domain(  ), ∃V  ∈ domain(  ), V  ⇒ V  , or V  ⇒ V  is strong rule, we say there are strong rules between   and   ; else, there are no strong rules between   and   .

Definition 9 (distance between sensitive attribute and cluster).
Assuming  is a cluster,   (1 ≤  ≤ ) is a sensitive attribute, and the distance between  and   can be defined as In this method, put the similar sensitive attributes into a same cluster, as long as the set of sensitive attributes is not empty, and generate new cluster constantly (line 1).For each new empty cluster , pick a sensitive attribute   (1 ≤  ≤ ) from sensitive attribute set  set orderly, put   into , and   is the first attribute of cluster  (line 2).Find out all such sensitive attributes   ∈  set (1 ≤  ≤ ), and   satisfies that distance(  , ) = 0. Add   to , and delete   in  set (line 3 to line 4).Similarly, generate other clusters.
(3) For each Here,   () represents the set of strong rules in table .
Proof.If  ∈ IR, there are no records containing strong rules in table IR, so ( :  ⇒ ) = 0.If  ∈ SAC ∼ , for each group  in SAC ∼ ,  has at least  records; if there is record containing  ⇒  in , according to the nature of random permutation, each record in  contains  ⇒  with equality probability 1/, so we can know that even though adversary can be clear of how many records contain  ⇒  through related background knowledge, ( :  ⇒ ) is not more than 1/.
In the example of Section 3.3, for some records, adversary can make sure that they contain strong rules with probability 100%.SAC ∼ table can prevent adversary from using strong rules to attack users' privacy.

Improved Rating (IR).
The table IR will be processed by improved Rating.For each sensitive attribute , Rating [17] hashes . in  by their values (each bucket corresponds to each value), and if  has  different values (V 1 , V 2 , . . ., V  ), it can get sequence = {bucket 1 , bucket 2 , . . ., bucket  }, and assuming there are  V  (1 ≤  ≤ ) in , so the corresponding bucket i contains  V  in it.Every time Rating chooses the  buckets that have the largest size, gets a value from every one of these  buckets to make up a SID, uses SID to replace corresponding sensitive attribute values in original table, and gets attribute table, and the SID makes up ID table.Every time Rating generates an SID, it needs to reselect  buckets, because after the last generation of SID, the sizes of buckets have been updated.Paper [17] has not presented the algorithm of choosing  largest buckets, so this paper will present a heuristic algorithm of choosing buckets.(1).The worst situation is that sequence [1].size∼ < sequence[].size∼ , the algorithm needs to compare  −  + 1 times, and the time complexity is ( −  + 1).The efficiency of this sorting algorithm is much better than most of existing sorting algorithms.

SID Creation
. This part will introduce the algorithm of creating SID.The definition of dangerous bucket will be introduced as follows.
Definition 18 (dangerous bucket).Assuming sequence is in sensitive attribute , for each bucket  ∈ sequence, if .size/ ≥ 1/,  is a dangerous bucket in .Here,  is size of the domain ().
The SID creation can be seen in Algorithm 19.For each sensitive attribute   , one generates its sequence and removes the dangerous buckets from sequence to SDB.Every time one generates a new   ID  , for each bucket in SDB, removes one value from it to   ID  (line 6), for each bucket of sequence [1], sequence [2], . .., sequence[-|SDB|] which have largest size, removes a value from it to   ID  (line 7).When a   ID  is completed, call for Algorithm 13 to sort the sequence.In the step of processing residual values, for each value V in a nonempty bucket, remove V to a   ID  which does not contain value V (line 9).
(3) Find the dangerous buckets, and put them in SDB.
(4) Remove the dangerous buckets in sequence; (5) When there are at least -|SDB| buckets in sequence which are not empty, generate a new   ID  repeat ( 6) to (8) Get the contradictory conclusions: From the above certification, we can find out that the dangerous bucket is always one of the  largest buckets, so each new   ID  should take one value of dangerous bucket, and it does not need to consider dangerous bucket for sorting sequence.This method further improves the efficiency of the algorithm.
Besides, in this improved Rating, the algorithm for AT&IDT Creation is the same as in Rating, uses   ID  to generalize corresponding value in IR table, after processing IR, gets AT, and then uses the set of   ID  to make up IDT.At last release AT and IDT with the previous SAC ∼ table.
Here we assume adversary masters strong rules and summarize the security of mixed model.For released table SAC ∼ , because each group contains at least  records without the same values and refers to Lemma 12, it is easy to know that SAC ∼ satisfy -diversity.For released AT, we will discuss a problem of probability tilts first.For record , after generalization if is a strong rule, there will be probability tilt between .  and .  obviously.And according to the method of partition table, all the strong values in AT belong to the same sensitive attribute, so the probability tilts will not happen in AT.Besides, each   ID  consists of at least  different values, so AT also satisfies -diversity.And through the above analysis, we know that mixed model is able to satisfy -diversity.

Analysis and Proof of Information Availability
This section analyzes the information loss of the new model from availability of association rules and the quality of published data table.Here, we can see that the mixed model preserves all the strong association rules, and user can get the confidence of strong association rules from the released tables.And the Rating model breaks all the relationships between sensitive attributes and generates unnecessary information loss. .This part uses the reconstruction error (RCE) [9,17] to measure the quality of the published tables.Assume original table  = (QI 1 , QI 2 , . . ., QI  ,  1 ,  2 , . . .,   ) gets a  +  dimensional space DS + ; for record  in table , the probability density function (pdf) of  is

The Quality of Published Data Table
Here, the  is a  +  dimensional variable in DS + .For record  in the released table of Rating, the pdf of  is Assume the Cluster Set = { 1 ,  2 , . . .,   } in the mixed model, the (QI 1 , QI 2 , . . ., QI  ,  1 ,  2 , . . .,   ) defines a  +  dimensional space DS + .In the released tables of the mixed model, if  ∈ SAC ∼ , the pdf of  is Here,  is a  +  dimensional variable in DS + , . ∼  represents the set of the possible values of .  , and |. ∼  |represents the number of the possible values.For example, in Table 6, a user wants to reconstruct the pdf of  1 ; in his view, the  1 . 1 can be ( 4 ,  2 ) or ( 1 ,  1 ) with equality probability 1/2, and  1 . 2 can be  2 or  1 with equality probability 1/2, so the pdf of  1 is If  ∈ AT, pdf of  is  rating  ().So in the released tables of mixed model, the pdf of  is We can measure the distance between released tables of mixed model and original table as follows: Here, assume  is a record in original table and  mm is the form of  in released tables of mixed model.Similarly,  rating is the form of  in released table of Rating, and the distance between released table of Rating model and original table is as follows: The released table would have higher quality with the smaller distance.Take all the records { 1 , The mixed model has lower RCE than Rating, which means the released tables of mixed model are closer to the original table.The linking between sensitive attributes in the same cluster is preserved, so the mixed model has higher quality than the Rating.

Experiment
The experiment uses the real dataset Adult (http://archive.ics.uci.edu/ml/datasets/Adult), we get 30718 records after deleting the incomplete ones, and the experiment consists of four parts: (1) execution time, (2) additional information loss, (3) accuracy rate of mining strong association rules, and (4) probability of privacy disclosure.We choose {education, occupation, age, relationship} as sensitive attributes, 2 {education, occupation}, 3 {education, occupation, age}, and 4 {education, occupation, age, relationship}.If there are no special statements, the experiments use the default parameters: the number of records is 30718, and the min confidence is 80%.
6.1.Execution Time.This paper presents an improved algorithm of Rating and mainly improves the algorithm of the SID creation, and the AT and IDT creation is the same as Rating.So this part will compare the improved algorithm of SID creation with the old one.Here, the old SID creation uses the stable bubble sort algorithm when choosing the  largest buckets.We set parameters { = 3, 2} and choose a certain number of records randomly, and Figure 1 shows that the execution time of improved SID creation is much better than the old one, because of the heuristic search.And the improved SID creation is more suitable for the large dataset.
Figure 2 shows the effect of sensitive attributes number on execution time.Because age has much more different values than other 3 sensitive attributes, Bubble sort needs more time to compare.After adding age attribute, the running time of SID creation increases rapidly.So the running time of SID creation is influenced seriously by the number of different values of sensitive attribute.

Additional Information Loss.
This part compares the additional information loss (AIL) [12] of mixed mode with  the Rating.In order to make AIL more suitable for mixed model and Rating, we slightly change its definition.Assume sensitive  has  SID in IDT, the additional information loss of  is Here, |SID  | represents the number of values in SID  .And the additional information loss of table  is: Here,  has sensitive attributes { 1 ,  2 , . . .,   }.
Figure 3 shows the AIL of the mixed model and Rating; when  increases, both the additional information losses of the two models increase basically.And the additional information loss of the mixed model is slightly more than the one of Rating but is always less than 0.03%, and the security of mixed model is enhanced.Here, Rating uses the stable bubble sort algorithm in SID creation.This part of experiment also finds an interesting phenomenon: if the sort algorithm in SID creation is unstable, the additional information loss will be much more than the stable sorting algorithm in SID.This phenomenon needs to be further studied.
Figure 4 shows the effect of the minimum confidence on additional information loss.When the minimum confidence decreases, the additional information loss increases.Because more records are put in SAC table, the available sensitive attribute values in the IR table will be less, and additional information loss increases.
Figure 5 shows the effect of number of sensitive attributes on additional information loss.Because all the sensitive attributes in Rating are processed independently, the AIL of Rating is almost not influenced by the number of sensitive   attributes.For mixed model, when the number of sensitive attributes increases, more records are removed to SAC table, and values for grouping are less in IR table, so the AIL of mixed model grows with the sensitive attribute number but is still in the realm of acceptable.

The Accuracy of Mining Strong Association Rules.
Strong rules tend to be valuable in practice, so the ability to provide strong rules will be analyzed for publication models by this experiment.We use the method of Lemma 22 to excavate strong association rules from the released tables of mixed model, and in the released tables of Rating and the raw    data table, we use Apriori to calculate confidence of strong association rules.
Figures 6 and 7 show the average confidence of strong rules in the three tables.We can see that if all the records in SAC and all values in IR can be grouped, user can accurately calculate confidence of strong rules from the released tables of mixed model, and the results also verify the conclusion of Lemma 22.When  = 5 in Figure 7, because the sensitive attribute relationship only has 6 different values and  is very close to 6, some records cannot be grouped in SAC, and they have to be deleted, and then the average confidence in   mixed model deviates from the one in raw data table, but the difference is small.And the average confidence of strong rules in Rating greatly deviates from the one in raw data table; because Rating breaks all the relationships between sensitive attributes, it is difficult for users to calculate the confidence of strong rules from the released tables of Rating.Figures 8 and 9 show the similar results.Because mixed model has considered association rules between sensitive attributes, it can provide more valuable relationships than Rating in released tables.6.4.The Probability of Privacy Disclosure.We refer to the method of paper [15] and analyze the probability of disclosure in this experiment.Assume adversary has background knowledge about strong rules.Because the records containing no strong rules satisfy -diversity according to [17] and previous analysis, we study the safety of records that contain strong rules in mixed model and Rating and analyze the probability that these records contain known strong rules from released tables.In Figure 10, -dimension represents the probability of disclosure, and -dimension represents the number of records.We can see that the probability records contain strong rules is 1/3 in the released tables of mixed model, mixed model can ensure a maximum of 1/ disclosure probability for records, and the conclusion of Lemma 12 is verified.On the other hand, because Rating has not considered the relationship between sensitive attributes, the disclosures probabilities for records in the released tables of Rating are more than 80%.
Figure 11 shows the similar result, the disclosures probabilities for many records are more than 1/ ( = 2) in Rating, while mixed model ensures a maximum of 1/2 probability for records.Here we will discuss an extreme situation.Assume  ⇒  is a strong rule, and in released tables of Rating, the number of records that actually contain  ⇒  is , and the number of records that may contain  ⇒  is ; obviously, we have  ≥ .But if  or  has very low frequency in raw data table, the  may be very small or even equals .When  = , adversary can be sure which records contain  ⇒  in released tables of Rating.So we can see that the disclosures probabilities for several records of Rating are 100% in Figure 11.From these experiments, we can know that the mixed model can prevent adversary from attacking data table with related background knowledge, and it is able to provide better protection for privacy.6.5.Summary of Experiment.Section 6.1 verifies the efficiency of the improved SID creation.And we know the additional information loss of mixed model is acceptable from the results of Section 6.2.Through the analysis of Sections 6.3 and 6.4,Rating cannot preserve strong rules in released tables, and as long as adversary masters background knowledge about these strong rules, Rating is unsafe.On the other hand, mixed model can provide strong rules for users forwardly and is able to ensure the security of privacy at the same time.

Summary
In view of the situation that most of existing privacy protection methods for multiple sensitive attributes have not taken the inherent relationship between different sensitive attributes into account, this paper presents that an attack method uses the association rules to get the users' privacy and accordingly presents a protection model.Through theoretical and experimental analysis, we prove that the new protection model can provide better protection for privacy, and it is able to preserve useful relationships in released tables.Besides, in order to improve the efficiency of algorithm, we present an improved SID creation method, and prove it is more effective with experiment.

4. 2 . 1 .
Heuristic Algorithm for Choosing Buckets.This heuristic algorithm (Algorithm 13) is actually a stable sorting algorithm, sort sequence in descending order, and choose the first  buckets from sequence.This algorithm will be called after updating the size of buckets.Let us introduce some parameters, and sequence[] refers to the  (1 ≤  ≤ ) bucket in the sequence: sequence[].value:the attribute value in sequence[], sequence[].size:the size of sequence[], sequence[].size∼ : after updating size, the size of sequence[], if 1 ≤  ≤ , sequence[].size∼ = sequence[].size-1;if  <  ≤ , sequence[].size∼ = sequence[].size,sequence[].position:the position of sequence[], sequence[].position= , sequence[].position∼ : after sorting, the new position of sequence[].

Figure 3 :
Figure 3: The comparison of additional information loss, 2.

Figure 4 :
Figure 4: The effect of minimum confidence on additional information loss,  = 3, 2.

Figure 5 :
Figure 5: The effect of sensitive attribute number on additional information loss,  = 4.
Average confidence of strong association rules

Figure 6 :
Figure 6: The average confidence of strong rules with varying , 2.

Figure 7 :
Figure 7: The average confidence of strong rules with varying , 4.

Figure 8 :
Figure 8: The average confidence of strong rules with varying the number of records  = 3, 2.

Figure 9 :Figure 10 :
Figure 9: The average confidence of strong rules with varying the number of records,  = 3, 4.

Table 2 :
Rating published IDT for Table
This part will divide the Table SAC into several groups and anonymize the records in the same group.

Table 4 :
The SAC table.In the algorithm of partition records, while table SAC is not empty, generate group constantly (lines 1-2).For each empty group , choose a record  from table SAC as 's first record (line 3).Choosing  ∼ ∈ SAC, or  ∼ ∈ IR,  ∼ does not have the same sensitive attributes values with , and add  ∼ to , until the number of records in  is not less than  (line 4-6).Within each group, sensitive attributes values are permutated randomly in each cluster to break the linking between different clusters (line 8).That is to say, adjust the position of cluster values randomly.Finally, release SAC ∼ .Now take Table1as an example, there are three steps on this process.Here, assume that  = 2, min confidence = 0.75. 1 and  2 make up a group, similarly, and  5 and  9 ,  6 and  4 ,  7 and  8 make up groups, respectively (Table5).After grouping, randomly permutate the cluster values in the same groups and release the table SAC ∼ (Table6).For example, in group 1, permutate  1 value ( 1 ,  1 ), ( 4 ,  2 ) randomly.Here, each group has two records, according to the random principle; after disturbing order, ( 1 ,  1 ) may swap position with ( 4 ,  2 ), or both ( 1 ,  1 ) and ( 4 ,  2 ) remain in the original positions.Similarly, for  2 , permutate  2 value,  1 , (1) Partition table: both  1 and  2 have four strong values, and  3 has no values, so  set ∼ = { 1 ,  3 }.We first find out that there are 4 records containing strong values in  set ∼ , and they are  1 ,  5 ,  6 , and  7 , respectively.These 4 records make up table SAC (Table 4) and meanwhile, delete these 4 records in Table 1.(2) Partition sensitive attributes: generate a new cluster  1 , add  1 to  1 , and now there remain two sensitive attributes in sensitive attributes set.Because there is strong rule  1 ⇒  1 between  1 and  2 and distance( 1 ,  2 ) = 0, add  2 to  1 .But both distance( 1 ,  3 ) and distance( 2 ,  3 ) are 1, distance( 1 ,  3 ) = 1,  3 cannot be added to  1 .And the only one attribute  3 in sensitive attributes set makes up a cluster  2 alone.So clustering is over, and we get two clusters  1 = { 1 ,  2 },  2 = { 3 }.(3)Partition records: according to the grouping condition, it cannot have the same sensitive attribute values in a group, 2 , randomly.Although through anonymity, the relationship between  1 and  2 is still preserved.On the other hand, linking between  1 and  2 has been broken.So this method preserves the links between sensitive attributes in the same clusters and breaks the links between sensitive attributes from different clusters.
[2])For each bucket in SDB, remove a value to   ID  .(7)For each bucket from {sequence[1], sequence[2], . .., sequence[-|SDB|]}, remove a value to   ID  ; (8) Call for Algorithm 13 and use sequence and -|SDB| as input; //processes the residual attribute values (9) For each value V in nonempty buckets, find a   ID  which contains no value V, and remove V to   ID  .If one cannot find this   ID  , value V cannot be grouped.If bucket  is a dangerous bucket in sensitive attribute   , after completing a new     ,  is still a dangerous bucket.Proof.After completing a new   ID  , the frequency of .value is (.size− 1)/( − ), and  is the size of domain(  ).Before generating the new   ID  , the frequency of .value was .size/.If there is (.size− 1)/( − ) ≥ .size/, Lemma 20 can be proved.The left side of the equation is equal to (.size−1) * /((− ) * ), and the right side of the equation is equal to .size * ( − )/( * ( − )); one only needs to prove that If bucket  is a dangerous bucket, after completing a new     ,  is still one of the  buckets which have the largest size.Proof.Using proof by contradiction, if there is Set = {bucket 1 , bucket 2 , . .., bucket  }, for each bucket ∈ Set, it has larger size than .According to Lemma 20, after generating a new   ID  , the .size ∼ / ≥ 1/, so bucket satisfies bucket.size∼ / ≥ 1/, Because before generating the new   ID  ,  was dangerous bucket, and satisfied .size/ ≥ 1/, has .size *  ≥ , .size *  −  ≥ .size *  − .size *  so we can get (.size− 1) *  ≥ .size * ( − ) .
Rules.In Rating model, all the relationships between different sensitive attributes are broken, the new model presented by this paper makes improvement, and all the strong rules can be preserved in released table.If the association rule is as follows:  ⇒  [confidence =  (0 ≤  ≤ 1)],  ≥   in the original data table, in the released data tables, the confidence of  ⇒  is still .Proof.The released data tables contain SAC ∼ table, ID table, and the attribute table, and user can get the support degree of  from SAC ∼ table and ID table, namely, support(), because in attribute table there is support(, ) = 0, one only needs to get support(, ) from the SAC ∼ table.So the confidence  of  ⇒  is support(, )/support().