Towards Self-Awareness Privacy Protection for Internet of Things Data Collection

. The Internet of Things (IoT) is now an emerging global Internet-based information architecture used to facilitate the exchange of goods and services. IoT-related applications are aiming to bring technology to people anytime and anywhere, with any device. However, the use of IoT raises a privacy concern because data will be collected automatically from the network devices and objects which are embedded with IoT technologies. In the current applications, data collector is a dominant player who enforces the secure protocol that cannot be verified by the data owners. In view of this, some of the respondents might refuse to contribute their personal data or submit inaccurate data. In this paper, we study a self-awareness data collection protocol to raise the confidence of the respondents when submitting their personal data to the data collector. Our self-awareness protocol requires each respondent to help others in preserving his privacy. The communication (respondents and data collector) and collaboration (among respondents) in our solution will be performed automatically.


Introduction
The Internet of Things (IoT) is now an emerging global Internet-based information architecture used to facilitate the exchange of goods and services.The concept of IoT is to allow living objects (humans or animals), devices (sensor), or object with embedded technologies to automatically transfer data over communication networks (wired or wireless networks) without human-to-human or human-to-computer interaction.IoT aims to utilize and extend the benefits of Internet such as always-on, data sharing, and remote access capabilities [1].
IoT enables data collection in every aspect of our life.Data collected from smart metering application allows the utility provider to analyze and improve its services.Also, these data can help the user to be aware of their energy consumptions and possible energy saving strategies.In an underwater environment, smart meter is particularly important because information can be detected, gathered, and sent to the sensor [2].
Let us consider the following scenario.A practitioner (data collector) would like to collect medical data from his patients (respondents) with implanted medical devices.Since medical data are highly sensitive information, respondents must be aware of the data to be collected.There are two main paradigms to protect the patient's privacy in this scenario.The first paradigm relies on the respondent's trust in the data collector while the second paradigm depends on the respondent's anonymity.If the respondents do not have confidence in the data collector, they may refuse to submit data or provide inaccurate data to the agency.If the submitted data from the respondents are not genuine, we can predict that the data collector will face the data utility problem because the analyzed results based on the collected data will not be accurate.In the second paradigm, we should prevent the reidentification problem.For instance, if the collected data are used for research purposes, the data collector should not be able to link any of the collected data to the real identity of any patient.
1.1.Challenges of IoT.Wireless sensor networks have been revolutionized by creating significant impact throughout the society [3].Advances in wireless communication technology (e.g., efficient resource management [4] and performance improvement [5] in wireless network) enable the development and implementation of IoT applications.IoT-related applications include traffic congestion detection and waste management in smart cities, remote diagnostics in patients' surveillance system (e.g., Ubiquitous healthcare [6,7]), and storage condition monitoring in supply chain control.
Along with potential benefits offered, the usage of IoT also raises some privacy concerns to the data owners.In particular, real-time data collection and data analysis in IoT applications may compromise the privacy of data owner.In practical, new data arrive continuously and up-to-date data should be used for analysis.The data collected at different times allows malicious providers to learn extra knowledge by cross-examining the data within a targeted timeframe.Therefore, a secure and privacy aware protocol should be implemented in IoT when data are collected automatically.Some new security and privacy challenges can be found in [8].
The development of radio frequency identification (RFID) technologies and the advances of network communication technologies motivate the forming of IoT [9].Physical objects called u-things which are embedded or connected to communication networks, sensors, and computers are commonly found in our daily life [10].In the context of IoT, uthings should be able to act automatically (e.g., autodetection and data transfer) and adaptively.The construction of smart u-things involves the following 7 challenges [11,12]: (i) surrounding situations (context), (ii) users' needs, (iii) things' relations, (iv) common knowledge, (v) self-awareness, (vi) looped decisions, (vii) ubiquitous safety (UbiSafe).
The ultimate goal of any ubiquitous intelligence is to make the u-things behave trustworthily in both other-aware and self-aware manners to some degrees and circumstances [13].Therefore, it is important to design a self-awareness protocol to help data owners to protect their privacy.
In this paper, we will focus on the self-awareness challenge.In particular, we design a self-awareness protocol to increase the confidence of the data owner when the smart uthings automatically submit their data to the data collector.

Problem Statement.
There are two challenges we aim to address in this work.Firstly, we want to protect the identity of each data owner from the data collector before and after the data collection process.Secondly, and more importantly, we want to guarantee the usefulness of the collected data by increasing the confidence of data owner.
The first challenge can be solved by using anonymity technology such as the onion routing (Tor) [14], anonymous proxy server [15], and mix network [16,17].These technologies are still under active investigation and their focuses are mainly on network traffic analysis, anonymous communication channel, and private information retrieval.Since our aim in this paper is not to design any of the specific anonymity technology, we refer readers to [15,18] for the usage of these technologies.
The second challenge requires each respondent to help others in order to preserve his own privacy.This idea is motivated by the coprivacy concept in [19,20].Coprivacy (or cooperative privacy) considers the best option for a party to achieve his privacy protection is to help another party in achieving her privacy.The formal definition of coprivacy and its generalizations can be found in [19].

Our Contributions.
In this paper, we propose a selfawareness protocol to facilitate the data collection in IoTrelated applications.Instead of placing full trust on the utility provider (data collector), we allow each data owner (respondent) to learn the protection level provided by the data collector before the data submission process.We summarize our contributions as follows.
(i) We propose a privacy preserved approach to enable the respondents to learn about the anonymous protection level they will receive from the data collector before the data submission.(ii) Our notion of self-awareness protection can be used to increase the confidence of respondents in the data collection process.Hence, respondents will feel comfortable to submit their genuine data while the data collector can ensure the usefulness of the collected data.
1.4.Organization.The rest of this chapter is organized as follows.The background and related work for this research are presented in Section 2. We describe the technical preliminaries of our solution in Section 3. We present our solution in Section 4 followed by analysis of correctness, privacy, efficiency, and discussion in Section 5. Our conclusion is in Section 6.In particular, the guideline aims to protect the consumer rights such as how online entities should collect and use the personal data [21].Five principles of FIPPs are as follows [22].

Background and Related Work
(1) There must be no personal data record-keeping systems whose very existence is secret.(2) There must be a way for a person to find out what information about the person is in a record and how it is used.
(3) There must be a way for a person to prevent information about the person that was obtained for one purpose from being used or made available for other purposes without the person's consent.
(4) There must be a way for a person to correct or amend a record of identifiable information about the person.
(5) Any organization creating, maintaining, using, or disseminating records of identifiable personal data must assure the reliability of the data for their intended use and must take precautions to prevent misuses of the data.
Based on the above principles, we now analyze the privacy protection in current IoT.Since data are collected automatically, it is hard for the data owners to ensure that their privacy can be protected.In most cases, utility providers will design a series of mechanisms to guarantee the privacy protection of the collected data.However, we found that data owners are generally not able to verify those mechanisms offered by the provider.Therefore, a self-awareness protocol should be available for automatic data collection process.

Anonymous Data Collection.
In general, online data collection is a process which involves collaboration between a trusted party (data collector) and a number of data owners (respondents).Due to concerns regarding privacy, respondents might refuse to contribute their personal data or submit inaccurate data to the data collector.Therefore, the data collector needs to ensure the privacy of data submitted through a series of secure mechanisms.However, the protection level provided by the data collector is hard to be verified by the respondents.
Often, data collected from the respondents will be used for research or data analysis.The release of the collected data causes a privacy issue in data publishing, in particular, when it involves the republication of the same data in a given period [23].There are two settings that can be observed when the data is released to the data recipient.If the data recipient is a third party, data must be released in an anonymous form without compromising the privacy of the respondents.Let us consider a scenario where a hospital (data collector) wishes to publish patients' records to a research institute (data recipient) for data analysis.In a common practice, all the explicit personal identity information (PII) such as name and social security number will be removed from the original dataset before it is released to the data recipient.However, removing PII does not preserve privacy.
Data anonymization is an interesting solution to protect the privacy of the respondents for this setting.Sweeney proposed -anonymity model to address the linking attack [24].The concept of -anonymity [25] is such that each released data is indistinct from at least (−1) other data.However, -anonymity is found vulnerable against background knowledge attacks by Machanavajjhala et al. [26].
In the literature, techniques such as (, )-anonymity [27,28], -diversity [26], and -closeness [29] have been proposed to enhance the -anonymity model.We note that these techniques assumed that -anonymity has been achieved in the first place before applying additional techniques to enhance the anonymous protection of the released data.For instance, (, )-anonymity model assumed that all the released data adhere to -anonymity.In addition, it requires that the frequency of the sensitive value in any quasiidentifier is less than  after the anonymization [27].In the -diversity model, the sensitive attribute in the -anonymous table is well represented by  values such that each sensitive value is at most 1/.A survey of recent attacks and privacy models in data publishing can be found in [30].
In this paper, we consider the second setting where the data analysis is performed by the data collector.This scenario is more complex to deal with because the data collector has the full access to all raw data from the respondents.Therefore, we need to design a protocol to increase the confidence of the respondents before they submit their records to the data collector.In other words, respondents are aware of the protection level they received from the data collector after the data submission.

Related Works
Various self-oriented privacy protections have been proposed in the literature.Self-enforcing privacy (SEP) for e-polling was proposed in [31].The idea of SEP is to enforce the pollster to protect the respondents' privacy by allowing the respondents to trace their data after the submission.If the pollster releases the poll results, the respondents can indict the pollster by using the evidence they obtained during the data collection process.A fair indictment scheme for SEP can be found in [32].
The most related research to our work in this paper is the respondent-defined privacy protection (RDPP) for anonymous data collection proposed in [33].The basic idea of RDPP is to allow the respondents to specify the level of protection they require before providing any data to the data collector.For instance, a number of respondents (minimum threshold) must satisfy the constraint chosen by the respondent  before he agrees to submit the data.In their protocol, respondents are aware of the minimum level of privacy protection they will receive before submitting their dataset to the data collector.Instead of relying on the data collector to guarantee the privacy protection, the respondents are free to define their preferred protection level.
In this paper, we do not consider indictment for our protocol because the data analysis is done by the data collector.Instead of allowing the respondents to freely define their own privacies, we assume that respondents are willing to submit their data if the protection level offered by the data collector can be verified by them.

Technical Preliminaries
4.1.Homomorphic Encryption Scheme.We use homomorphic encryption scheme (i.e., Paillier [34]) as our primary cryptographic tool.Let Enc pk () denote the encryption of  with the public key, pk.Given two ciphertexts, Enc pk ( 1 ) and Enc pk ( 2 ), there exists an efficient algorithm + ℎ to compute A quasi-identifier (QI) is a minimal set of attributes in  that can be joined with external information to uniquely distinguish individual records [24].Note that the quasiidentifier can be either categorical or continuous data while the sensitive attribute is a categorical data from its domain.
Definition 1 (quasi-identifier).A quasi-identifier (QI) is a minimal set of attributes that can uniquely distinguish tuples in .The QI for Table 1 is {Gender, Age, Zip} and it can be generalized as {Male, 10-16, 278 * * }.
Definition 2 (-anonymity). is said to satisfy -anonymity with respect to QI if and only if each set of attributes in QI appears at least  occurrences in .
Definition 3 (self-awareness privacy).Each respondent  is said to achieve self-awareness privacy if he learns the protection level (e.g., -anonymity) provided by the data collector.At the end of the protocol execution, each respondent remains anonymous to others and the data collector is not able to identify any of the respondents with probability more than 0.5.

Components.
Our self-awareness data collection protocol consists of the following three components.
(i) Data collector: an authorized party who wants to collect data from a group of respondents via wired or wireless network.
(ii) Respondent: participant in the data collection process who is also a candidate to submit his/her record to the data collector.
(iii) The onion router (Tor): an anonymous network used to conceal the respondent's privacy such that the agency cannot monitor the activity flows of any respondent.
We show the interactions among the components in our solution in Figure 1.We assume that the respondents and the data collector are equipped with ubiquitous sensors to detect, communicate, and execute the protocol.

Adversary Model.
We assume that both the data collector and the respondents are semihonest players (also known as honest-but-curious).Semihonest players follow the protocol faithfully but may try to discover extra information during the protocol execution.
In our protocol design, the data collector must follow the protocol faithfully in order to ensure that all respondents are willing to participate in the data collection process.For the same reason, all respondents should be semihonest in order to ensure that the privacy protection level offered by the data collector can be achieved.

Notations Used.
The notations used hereafter in this paper are summarized in Notations section.

Self-Awareness Data Collection Protocol
5.1.Protocol Idea.The basic idea of our protocol is to allow the respondents to know the protection level they will receive from the data collector before the data submission process [35].In our design, the data collector will release a set of quasi-identifiers QID = {QI 1 , QI 2 , . . ., QI  } for  and define a protection level it wants to provide to the respondents (e.g., a threshold ).Note that a larger  will make the respondents feel more comfortable to submit their records.We also require the respondents to collaborate together to find the number of records in (D 1 ∪ D 2 ∪ . . .∪ D  ) which met the quasi-identifier determined by the data collector.We assume that the communication between the data collector and the respondents is via a mixture network such as Tor [14].Note that the communication (respondents and data collector) and collaboration (among respondents) in our solution are run automatically.We show the overview of our proposed solution in Figure 1.
In the following sections, we will describe our self-awareness data collection protocol in details.

Our Protocol.
In order to participate in the data collection process, all players can precompute some information to be used during the protocol execution.For example, each respondent  can generate a cryptographic key pair (pk  , pr  ) where pk  is the public key and pr  is the corresponding private key.Next, the respondents encrypt their personal identifiable information (PII) such as name or social security number by using the pk  .The encrypted PII will be used as the public identity I  of the respondent .This public identity is important for other respondents to identify the owner of a given public key.Each respondent then submits his public identity and encryption key to the data collector via a Tor network.Let us assume there are  respondents who participate in the data collection process and, hence, the data collector will receive  submissions (I 1 , pk 1 ), (I 2 , pk 2 ), . . ., (I  , pk  ) from the respondents.
Before the data collection begins, the data collector is required to define a set of  quasi-identifiers denoted as QID = {QI 1 , QI 2 , . . ., QI  } for the dataset  to be collected and determine the protection level (e.g.,  value) for the respondents.

Data collector
Respondents Enc pk 2 (s 2 2 ) Enc pk 2 (s 1  2 ) Enc pk  ( n ) Enc pk  (s n 1 )  i = {Enc pk  (s Table 2: Outcome table released by the data collector. Enc pk 1 ( To initiate the protocol, the data collector first randomly assigns a public key pk  for each QI  ∈ QID.If |QID| > , the same public key can be assigned to more than one quasiidentifier.Otherwise, the data collector selects / of the public keys for the assignment.For simplicity, we will assume that the size for both quasi-identifier and public key is equal (i.e.,  = ) and ℓ = {(pk 1 , QI 1 ), (pk 2 , QI 2 ), . . ., (pk  , QI  )}.Next, the data collector publishes (I, ℓ) to a shared location (e.g., a webpage): (I, ℓ) = {(I 1 , (pk 1 , QI 1 )) , (I 2 , (pk 2 , QI 2 )) , . . ., (I  , (pk  , QI  ))} . ( Based on the information from (1), each respondent  retrieves ℓ to examine if his records in D  match any of the quasi-identifiers QI  ∈ QID.At this phase, each respondent  maintains a scores list for QID, { 1  ,  2  , . . .,    }.We denote    as the score determined by the respondent  for QI  .The respondent raises each score by 1 when a record in D  matches the quasi-identifier.Upon the completion, the respondent  encrypts each    by using the public key pk  assigned to the quasi-identifier QI  .The encrypted scores list computed by each respondent  can be represented as Then, all the respondents send   to the data collector and a shared location.Note that this location can be a separate space that is not shared with the data collector.
Upon receiving   from all the respondents, the data collector performs the following tasks.
(1) Aggregates the scores determined by all respondents for each QI  .The data collector performs this computation in an encrypted form by using the additive property of the Paillier cryptosystem.The output of the aggregation can be represented as (2) Publishes an outcome table.The data collector publishes the scores for each QI  in an outcome table as shown in Table 2.In Table 2, each row (u  ) represents the encrypted scores received from each respondent  while the column (V  ) shows the encrypted scores for each quasi-identifier QI  .Note that all the data in V  are encrypted by using the same public key pk  .Therefore, only the respondent who has been assigned the QI  can decrypt Enc pk  (S  ) to learn the number of matched records (S  ) for QI  .
After the data collector releases the outcome table, the respondents need to verify that the data released are genuine.For instance, each respondent  verifies that the encrypted scores list   submitted to the data collector appears as one of the rows in Table 2.If the respondent fails to verify the data, he or she then issues a decision message   with a random value.
Let us assume all the respondents successfully verify the data in Table 2. Next, each respondent  retrieves V

Self-Awareness Data Collection Protocol Phase 1: Public Key and Public Identity Submissions
The data collector broadcasts a submission request to  respondents.Each R  generates a cryptographic key pair (pk  , pr  ) and a public identity I  by encrypting its personal identifiable information (PII).Note that the respondents can precompute the cryptographic key pair and the PII in an offline mode.Next, each R  sends (I  , pk  ) to C via the Tor network.

Phase 2: Satisfaction Scores Computation
The data collector C generates QID, decides a threshold  and assigns a public key for each QI  .Next, it broadcasts the information to all respondents.Each R  examines if his record in D  satisfy QID.For each satisfy case, the R  increases the constraint score  (based on his public identity I  ) and decrypts all encrypted data by using the private key pr  .After the decryption, the respondents must ensure that the aggregated score Enc pk  (S  ) computed by the data collector is correct.The respondents can verify this by computing S  = ∑  =1 (   ) from the decrypted scores and then compare it with the decrypted result of Enc pk  (S  ).Lastly, each respondent  compares S  with the threshold  determined by the data collector.If the number of matched records S  is greater than the threshold value (e.g., S  ≥ ), we assume that the respondent will submit his records to the data collector.Otherwise, the respondent will abort from the data collection process.
At the final phase, each respondent  sends a decision message m  to the shared location.If the decision message m  is set to 1, this indicates that S  ≥ .Therefore, the respondents should submit their records to the data collector.Otherwise, if m  is set to 0, the respondents should not reveal any record to the data collector.
We summarize our self-awareness data collection protocol in Algorithm 1.

Analysis and Discussion
6.1.Analysis of Correctness.In this paper, we assume that both the data collector and the respondents are semihonest players.The semihonest model is realistic in our solution.If both players follow the protocol faithfully, each respondent can ensure that he will achieve the protection level offered by the data collector (e.g., -anonymity).At the same time, the data collector can guarantee that the datasets collected are useful for analysis.
During the protocol execution, all respondents are required to verify (1) the encrypted scores released by the data collector are genuine and (2) the aggregated score for each QI  computed by the data collector is correct.The first verification is to ensure that the data collector has received all data computed by the respondents correctly while the second verification is useful for the respondents to detect a malicious data collector.
In our protocol design, the data collector needs to define a protection level (e.g.,  value) before the data collection begins.The data collector can define the same protection level for all QI  or define difference in anonymous levels   for each QI  ∈ QID.For the latter case, the respondents can perform the same steps to verify each value of   .

Analysis of Privacy.
The privacy analysis of our protocol depends on how much information has been revealed during the protocol execution.In general, our solution should protect the privacy of the respondents.This leads to the following two requirements: (1) the data collector should not be able to infer any sensitive information of the respondents from the data collected and (2) the respondents are aware of the data they submit and the protection level they will receive from the data collector.
In our protocol design, we utilize Tor network to prevent direct communication between the data collector and the respondents.This approach will not allow the data collector to track the identity of any respondent.Also, we assume that each respondent has no knowledge about the profile of other respondents, but the number of respondents in the protocol is known publicly.
The unique identity I  of each respondent will not leak the profile of any respondent because they are in an encrypted form.The data collector is not able to decrypt   in the absence of private keys from the respondents.Further, our protocol ensures that no party (including the data collector) can learn the encrypted score in the outcome table before the decryption.Note that only the respondent who has the private key can perform the decryption.
To prevent possible collusions between the data collector and other respondents, we assume that all data transmissions are performed via an anonymous communication channel (e.g., Tor network).This can ensure that the profile of each respondent remains anonymous from others.
The shared location (e.g., web page or web folder) used in our protocol is to allow the respondents to learn the decisions made by others and to detect a malicious data collector.Each respondent notifies others about the verification result by using a decision message m.Since the decision message only reveals the public identity of the respondents, we can assume that the profile of the respondents remains hidden from others.

Analysis of Efficiency.
The complexity of our protocol is dominated by the cryptographic operations (encryption and decryption) performed by respondents.We implement our protocol in Java and ran it on a single computer with a 2 GHz CPU and a 2 GB RAM.The performance evaluation is shown in Figure 2.Each respondent performs the same amount of cryptographic operations in our experiment.6.4.Discussion.In this paper, we assume that the size of the public keys (or the number of respondents) and the quasiidentifier is equal (e.g., |R| = |QID| = ).However, our protocol works correctly for unequal cases.The owner of the public key only performs the decryption and computes S  at the end of the protocol execution.A respondent may not be involved in the final phase if his public key is not selected by the data collector (for cases when |R| > ).Otherwise, a respondent needs to repeat final phase for several times if his public key is assigned to more than one QI  .

Conclusion and Future Work
In this paper, we presented a self-awareness protocol for IoT data collection.Since the release of raw data to the data collector has a high risk to compromise privacy of the respondents, we aim to increase confidence of the respondents before they submit their records to the data collector.Our selfawareness protocol allows each respondent to help others in  order to preserve his own privacy.At the same time, the final collected data should adhere to the protection level promised by the data collector before the data collection begins.Also, our solution can be extended to support indictment scheme (when the data is released to a third party) because the respondents have evidence (e.g., value of ) to indict a malicious data collector.

Figure 1 :
Figure 1: Overview of the proposed solution.

𝑗 𝑖 by 1 .
We denote    as the score determines by R  for QI  .Next, each R  encrypts {   |  = 1, 2, . . ., } by using the public key pk  to produce   = {Enc pk  (   ) |  = 1, 2, . . ., }.Each R  then anonymously sends   to C and a shared location.Phase 3: Scores List Verification The data collector C computes and publishes an outcome table.Each R  examines if the published scores list is same as the original list he sent to C. If the list has been modified, the respondent will not participate in the next phase.Phase 4: Satisfaction Score Checking Each R  retrieves and decrypts {Enc pk  (   ) |  = 1, 2, . . ., }.Next, it computes S  = ∑  =1 (   ) as the satisfaction score for QI  .If the satisfaction score S  is at least with   occurences (e.g., S  ≥   ), the R  sends m  = (I  , 1) to C. Otherwise, m  = (I  , 0) will be sent to C.Phase 5: Data SubmissionThe respondents submit his record to C with the confidence that their privacy protection is achieved at -anonymity level.Algorithm 1: Self-Awareness data collection protocol.

Figure 2 :
Figure 2: Performance of the proposed solution.

Table 1 :
Sample medical dataset.Enc pk ( 1 +  2 ).This additive property can be performed without the decryption key.4.2.Definitions.Let us assume that there are  respondents R = {R 1 , R 2 , . . ., R  } and a data collector C. Each respondent  has a database D  with  records.We denote  as the dataset collected by the data collector.Also, the dataset  consists of  quasi-identifier QID = {QI 1 , QI 2 , . . ., QI  } and a sensitive attribute.Note that the quasi-identifier can be either categorical or continuous data while the sensitive attribute is a categorical data from its domain (e.g., disease).
Size of the quasi-identifier QI  : th quasi-identifier in QID I  : Public identity of the respondent     : Score determined by the respondent  for QI  S  : Satisfaction score of QI  pk  : Public key of respondent  pr  : Private key of respondent  Enc pk  (⋅): Encryption operation by using pk  Dec pr  (⋅): Decryption operation by using pr : Decision message from respondent .