A Clustering K-Anonymity Privacy-Preserving Method for Wearable IoT Devices

Wearable technology is one of the greatest applications of the Internet of Things. The popularity of wearable devices has led to a massive scale of personal (user-specific) data. Generally, data holders (manufacturers) of wearable devices are willing to share these data with others to get benefits. However, significant privacy concerns would arise when sharing the data with the third party in an improper manner. In this paper, we first propose a specific threat model about the data sharing process of wearable devices’ data. Then we propose a K-anonymity method based on clustering to preserve privacy of wearable IoT devices’ data and guarantee the usability of the collected data. Experiment results demonstrate the effectiveness of the proposed method.


Introduction
Wearable technology is one of the greatest applications of the Internet of Things.For the past few years, wearable devices have seen an explosive growth of popularity [1].Corresponding to such advancement, more sensors are available to record various aspects of our daily lives [2], influencing our lives in an unconscious way.
However, security problems appear with the wide deployment of wearable devices.The most severe threat would be the privacy leakage of wearable devices data.After collecting data from the smart terminals, data holders (manufacturers) of wearable devices are willing to share the data with application developers to enrich their services or obtain monetary benefits.Typically, the data collected by these devices contain abundant privacy information [3,4].In addition, when sharing the data recorded by human-carried wearable sensors, some personal information, such as age, height, and weight, may also be submitted under warrant [5].Therefore, though the original intention of data sharing is always positive, the uncontrolled personal information may raise the risk of privacy disclosure.
To balance the benefit of data sharing and the risk of privacy disclosure, we emphasize that it is critical to share data in a privacy preserved way.The privacy issues of wearable devices have been raised.Previous researches attempt to limit the privacy disclosure mainly by establishing several rigid laws [6].Further, researchers have formulated some restrict standards to share personal data collected from wearable devices.These rules advocate that users are authorized to determine with whom their data would be shared and supervise the application of their contributed data.The third party must be supervised during the disposal of collected data.Although these laws play an important role in preserving privacy, there are still some vulnerabilities.It is hard to expect these rigid laws to prevent tricky adversaries.On the other hand, encryption and identity authentication are usual ways to preserve privacy.Although these methods have been proved to be effective in many cases, they are impractical in sharing data with the third party whose identity is uncertain [4].
Sharing these data anonymously seems to be a better choice.-anonymity is a successful method to share data for its briefness and effectiveness.The data collected by wearable devices are always identifiable [7], which would be a severe threat against -anonymity.Fortunately, certain characters of the dataset could be processed with -anonymity to enhance the security level of users' privacy.In this paper, we introduce a clustering based -anonymity method as the building block of privacy preserving for wearable devices contributed data.The clustering -anonymity would assign similar records to the same equivalent set, while the similarity among these records makes it harder to discriminate different identities than before.The notable contributions of this paper could be summarized as follows: (1) We analyze the potential vulnerabilities of existing privacy-preserving methods for shared wearable devices data.(2) We propose a threat model to achieve deanonymity against -anonymity technique.Besides, we point out the vulnerabilities of the technique and improve it by referring to the inherent characteristics of wearable devices data.(3) We evaluate the effectiveness of the proposed method with simulation experiments.
The rest of this paper is organized as follows.In Section 2, we introduce current researches on privacy preserving of wearable devices data.In Section 3, we introduce the attack model against the vulnerabilities of previous anonymity methods.In Section 4, we describe the clustering -anonymity to solve the privacy problem.In Section 5, we discuss the performance of our improved method.Finally, we discuss and conclude the paper in Sections 5.3 and 6.

Related Work
The rapid growth of wearable devices provides a massive scale of personal data, which would usually be gathered by the data holder.In some cases, data holders need to share data with others without compromising the privacy and keep its usability at the same time.In this section, we summarize existing methods about data sharing from two aspects, namely, privacy preserving for wearable devices data and anonymous sharing, respectively.

Privacy Preserving for Wearable Devices Data.
There have been several studies about privacy preserving of wearable devices data.Current data holders of wearable devices protect the users' privacy mainly by some rigid rules.As Figure 1 shows, the data collected by wearable devices are typically stored in a database owned by the data holder.The third party who wants to acquire the users' data must get the permission of users at first.Users of wearable devices determine whether to share their personal data, and they are authorized to trace the use of their personal data.The third party must conform to the users' willing, and they could not violate these rules.The intention of the third party must be honest.
These rules play an important role in preserving the privacy, but there are several vulnerabilities.On one hand, we cannot guarantee that authentication works well.If the Access Control is bypassed by someone malicious, the users' privacy would be disclosed.On the other hand, most users of wearable devices are unprofessional, they could not understand the significance of their data, and their consciousness about privacy preserving is poor, making them vulnerable to potential attacks.

Anonymous Sharing.
Encryption is a widely adopted and traditional way to preserve privacy, while it is designed for data sharing.The drawback of encryption in data sharing is secret key distribution as we cannot guarantee the reliability of the third party.Hence, preserving privacy by encryption is impractical.It is critical to find a way to preserve the privacy even when the malicious third party has acquired the data.
Differential privacy is an excellent method to preserve the privacy even when the overall background knowledge has been disclosed.The conception of such method is to preserve the privacy by adding moderate noises [8].However, we note that this method would influence the usability of data with the involvement of noises.
In 2002, Sweeney proposed -anonymity [9].anonymity requires each record to be indistinguishable from at least  − 1 other records in quasi-identifiers domain.In the -anonymity model, three compositions of each record are defined: (i) Attributes that clearly identify individuals, such as Social Security Number and ID Number, are defined as identifiers.(ii) Insensitive attributes that are combined to jointly identify individuals, such as name, sex, age, and Zip, are defined as quasi-identifiers.(iii) Attributes that are considered sensitive, such as salary and illness, are called sensitive attributes.In this paper, the sensitive attribute is the time serial collected by wearable devices.For convenience, we use ID, QI, and SD to represent identity, quasi-identity, and sensitive data, respectively, in this paper.
-anonymity is an appropriate approach to share data anonymously.According to the principle of -anonymity, data holders cut the linkage between ID and SD before sharing, in which case the ID information could not match with the SD information accurately.Further, such processing would cause no information loss in the SD information.However, inherent vulnerability of -anonymity determines that the naive -anonymity could not meet our requirement.In the next section, we discuss the threat model against anonymous data sharing of wearable devices.

Threat Model
In order to preserve wearable devices' data privacy, we should learn about the threat of the privacy first.In this section, we first discuss the link attack; then we introduce the process of achieving deanonymity of the sensitive data.We introduce the detail process of privacy disclosure in Section 3.3 and narrate the whole threat model with an illustrative example in Section 3.4.

Link Attack.
Leaving alone the privacy requirement, datasets that need to be shared are always composed of several quasi-identifiers and sensitive data, without the ID information.Therefore, we could define the structure of all the records in the form {QI 1 , QI 2 , . . ., QI  , SD}, while the background knowledge could be denoted as {ID, QI 1 , QI 2 , . . ., QI  }.Such kinds of information could be acquired by gathering other insensitive information.We combine these pieces of information through the quasiidentifier domain {QI 1 , QI 2 , . . ., QI  } and then obtain information in the form of {ID, QI 1 , QI 2 , . . ., QI  , SD}.This result indicates the privacy is disclosed.Figure 2 gives an example of link attack while  = 3.In Figure 2, every identifier points to a sensitive data unit, so some privacy information within the sensitive data could be relinked back to a specific identity.As a result, sensitive data of specific identities are disclosed.

Deanonymity.
In Section 3.1, we introduce the link attack briefly.The link attack could be well addressed by the -anonymity.However, the threat would be more severe if sensitive data were identifiable.In the process of link attack, the adversary combines the background knowledge {ID, QI 1 , QI 2 , . . ., QI  } with the shared dataset {QI 1 , QI 2 , . . ., QI  , SD} by quasi-identifiers {QI 1 , QI 2 , . . ., QI  }.Data holders could generalize quasiidentifiers {QI 1 , QI 2 , . . ., QI  } according to the principle of anonymity to prevent privacy disclosure against link attack.In cases where sensitive data are identifiable, -anonymity would be hard to preserve privacy.-anonymity is designed with little consideration about this form of privacy disclosure.
Wearable devices' data might be identifiable (e.g., GPS data, or data collected by triaxis accelerators).It is obvious that the data such as GPS is identifiable.According to the different traces of people, it would be easy to infer a user's identity.The data collected from triaxis accelerators seem insensitive, but they could be applied to discriminate the identity by means of machine learning.There have been several researches about recognizing one's identity by the data collected from triaxis accelerators [9][10][11].
Recognizing identities with the machine learning methods is a critical threat to the privacy.One may argue that if there are a large number of people, such attempt would be too complex to be practical.However, the link attack could be used here to shrink the data scope.

Whole Threat Model.
In Sections 3.1 and 3.2, we introduce the link attack and describe deanonymity of the wearable device data.The whole process would be described as follows: (1) Collect the shared data {QI 1 , QI 2 , . . ., QI  , SD}.
(4) Recognize the identity of each person in ES by ID through machine learning method.
(5) Rebuild the correspondence between ID and SD.
After the processing, the correspondence between users' identities and their sensitive data is rebuilt, unavoidably resulting in the disclosure of privacy.The threat model is shown in Figure 3.

An Example of Privacy Disclosure.
For example, as Table 1 shows, Alice is an owner of a wearable device, and the manufacturer of the device collects the data produced by this device and the information about her age, height, and weight.Then the data holder shares a dataset (as Table 1 shows) which contains Alice's data.The adversary Evil gets this information, and he knows that Alice is 181 cm and 71 kg and of the age 24, so that Evil could get Alice's sensitive data readily by combining the dataset with the background knowledge.
The data holder cuts the linkage between identity and sensitive data by generalizing the quasi-identifiers before sharing according to -anonymity.Table 2 shows the 2anonymity result of Table 1.In Table 2, it would be hard to recognize Alice's identity with link attack.However, the data contained in SD could still disclose the identity of Alice.Specifically, if we extract proper feature of these data and put it into a suitable classifier, the identity could be recognized.
Figure 4 shows the discriminating rate.We divide 14 subjects according to the sequence directly.Each equivalent set contains 2 records.Obviously, the discriminating rate could indicate the severe threat of privacy disclosure.

𝐾-Anonymity Scheme
In our work, we try to adjust the division of records, making it hard to discriminate the identity within each equivalent set, thus preserving privacy.We find that, in the dataset of wearable devices, quasi-identities are always relevant to the sensitive data.For example, the dataset about GPS contains the quasi-identities about address, and the dataset about the gait contains the quasi-identities such as height, age, and weight.In this section, we try to assign such records with similar quasi-identifiers to the same equivalent set.Because of the relevance between SD and QI, it would be harder to recognize a specific identity in equivalent set than before.We clarify the meaning of clustering -anonymity in Section 4.1 and describe the details about clustering anonymity in Sections 4.2 and 4.3.according to -anonymity, and each set contains at least  and no more than 2 records.However, different division produces different effects on security.

Meaning of Clustering
In this paper, for preserving users' privacy, we expect the records in the same equivalent set to be as similar as possible.We find that QI of the shared datasets are usually closely related to the SD.For example, the dataset that contains GPS data may share address information, and the dataset about the data collected from triaxis accelerators shares the information such as age, height, and weight together.In these kinds of datasets, the quasi-identifiers are Zip code, age, height, and weight.We process the dataset with clustering and group records with similar quasi-identifiers in the same equivalent class.The rationale is that it is easy to discriminate the identity of people for the huge trace differences among different people, while if we cluster similar records by quasiidentifiers (e.g., address information), the differences would be reduced.Given more similarity between records in one equivalent class, there are fewer risks of privacy disclosure.

Distance Metric.
The similarity between two records determines the division of datasets directly.There are detail descriptions about all kinds of data in dataset in [12][13][14].All of these works try to transfer nonnumeric data into numerical value for further processing.Without loss of generality, we consider the case that all the data are numerical values.
In this paper, the similarity of two records is calculated by measuring the distance between two records.Intuitively, a larger distance indicates a smaller similarity, and vice versa.Let the quasi-identity domain of records  and  be { 1 ,  2 , . . .,   }, and { 1 ,  2 , . . .,   }, respectively.  ,   denote the th quasi-identity.{ 1 ,  2 , . . .,   } denote the weights, where   is the weight of the th quasi-identifier.The distance (, ) between records  and  can then be defined as (1) 4.3.Details of the Clustering -Anonymity.In this section, we discuss the details of the clustering -anonymity in Algorithm 1 and then analyze its time complexity.At first, we cluster the records in private table which need to be published, and assign similar records to the same equivalent set.Then, we unify the quasi-identifiers in the same clusters by generalizing and suppressing operations.The output of this algorithm is a table that satisfies the principle of -anonymity.All the records in the same equivalent set are similar to each other.In this way, it would be harder to recognize the users' identities in one equivalent set; the privacy of these subjects would be more secure.We show the effectiveness of our method in Section 5.2.
The process of clustering -anonymity algorithm is shown in Algorithm 1.

Time Complexity Analysis.
Although the clustering anonymity well preserves the users' privacy, its feasibility should be further verified.
In the clustering phase, the time complexity of the operation that selects the nearest tuple should be (), and the time complexity of the operation that selects the farthest tuple should be (), so the overall complexity of clustering operation should be ( 2 ).In unifying phase, we check all equivalent sets in the dataset at first and then check each tuple in the equivalent set.Obviously, the overall time complexity in the unifying phase should be ( 2 ).So the time complexity of this algorithm should be ( 2 ).Such time complexity demonstrates that clustering -anonymity could be achieved within finite time.

Performance Evaluation
In this section, we evaluate the performance of clustering anonymity, mainly considering the performance on security.We focus on the data collected by the triaxis accelerator for its popularity [9][10][11]15], and its insensitive impression.Experiment results would verify the effectiveness of the clustering -anonymity.

Experiment Settings.
In this experiment, to show the effectiveness of clustering -anonymity, we compare 4 kinds of -anonymity.The 4 methods are Partial Datafly anonymity, Overall Datafly -anonymity [16], -Argus anonymity [17], and clustering -anonymity, and they are different in the division of datasets.We want to demonstrate that the division of dataset could influence the security of privacy.
The measurement of distance (, ) between two records  and  is a critical factor to influence the final result.Here, we define the distance (, ) as follows: where  1 ,  2 , and  3 denote the age, the height, and the weight information, respectively.These attributes are quasiidentifiers in the dataset.We determine these parameters according to several rounds of experiment results.This group of parameters is effective in influencing the final result.Note that if we adopt a more accurate model instead, we would get more accurate result.
In this experiment, we achieve deanonymity with the data collected from triaxis accelerator sensors.The goal of this method is to preserve privacy, so a lower discriminating rate within one equivalent set suggests a higher security performance.We show the experiment results in Section 5.2.

Comparative Results and Analysis.
Figure 5 shows the discriminating rate of the identities in each equivalent set.The dataset is divided according to the principle of 2-anonymity.It is clear that the discriminating rate of clustering 2-anonymity is relatively lower than other 2-anonymity.We can thus claim that the clustering 2-anonymity is the most secure method among the four considered methods.Figures 6,7,8,and 9 show the results of 3-anonymity achieved by 4 methods mentioned above, respectively.These figures show the discriminating rate of the identities.It is obvious that the discriminating rate distribution of clustering 3-anonymity tends to be lower than other methods.More than half of the discriminating rates of the clustering 3anonymity are lower than 60%, while, for the other methods, most of discriminating rates are more than 60%.The result demonstrates that clustering 3-anonymity is more secure than others.
On the other hand, the clustering -anonymity brings no change to the sensitive data domain, so the usability of sensitive data could be guaranteed.
Analysis.In this experiment, the SD of all the records stay invariant.Because of the different combination about the equivalent set, the discriminating rate in each equivalent set would be different.Reasonable assignment of records improves security level of clustering -anonymity.

Discussion.
In this section, we discuss some interesting open research issues.

Figure 1 :
Figure 1: An overview of the privacy-preserving rules.

Figure 3 :
Figure 3: An overview of the threat model.

Figure 4 :
Figure 4: The discriminating rate of identity.
-Anonymity.-anonymity is a general conception to share data in a privacy-preserving way.Dataset could be divided into several equivalent sets

Table 2 :
Anonymity result of original data.