Frequent Symptom Sets Identification from Uncertain Medical Data in Differentially Private Way

,


Introduction
The Internet of Things (IoT) involves a lot of different base technologies, such as wireless sensors, data management, and cloud computing [1].Today, IoT technology is successfully applied in the field of eHealth [2][3][4].Medical personnel can utilize IoT technology to collect large amounts of patient data that can assist them in providing better medical services to patients [5,6].
Frequent itemsets mining is applied in fields such as eHealth and bioinformatics.Traditional algorithms for mining frequent itemsets from medical data are based on certain data [7] and can be applied to discover hidden symptom patterns from a huge amount of data on patient symptoms.These patterns can be used by health managers to provide better healthcare for users [8].For example, in [9,10], the Apriori algorithm was applied to identify prevalent diseases and analyze medical billing.However, the Apriori algorithm mines frequent itemsets from certain data.In medicine, for different physical conditions of patients, the same physiological index corresponds to a different symptom association probability for each patient.As a result, there is uncertainty in patient data.Therefore, traditional algorithms for mining frequent itemsets from certain data cannot be directly applied to patient data.
Another important factor is that medical records contain sensitive patient information.An adversary with sufficient background information can make use of frequent patterns mined from patient data to obtain the sensitive information of patients.Hence, it is very important to protect patient privacy when mining frequent itemsets from medical data [11].
The set of symptoms that a patient suffers from constitute the patient's data.Because of the probabilities associated with these symptoms, there is uncertainty in patient data.A large amount of patient data constitutes uncertain data.In the field of medicine, there are plenty of researches on symptom association probability.For example, one study monitored oesophageal pH over a 24 h period to obtain symptom association probability, which was then utilized to evaluate the association between a patient's symptoms and gastroesophageal reflux [12].By analyzing the large amounts of patient data, Beglinger et al. determined the probability that a patient suffering from Huntington's disease also had obsessive and compulsive symptoms [13].By analyzing the data of patients suffering from irritable bowel syndrome, Arsiè et al. determined the probability that indicated the association between meal ingestion and abdominal pain symptoms for patients suffering from irritable bowel syndrome [14].In this paper, based on symptom association probability obtained by medical technology, we focus on how to mine frequent itemsets from uncertain medical data, while also protecting data privacy.In the uncertain medical data, each item corresponds to a symptom of patients.
In this paper, a new algorithm, denoted as U-PrivMining (uncertain medical data differentially private frequent itemsets mining), is proposed to mine the top  most frequent itemsets from uncertain medical data in a differentially private way.In uncertain medical data, each item corresponds to a symptom of patients.U-PrivMining has two phases.In the first phase, based on traditional algorithms for mining frequent itemsets from uncertain data, spare vector algorithm and the Laplace mechanism are applied to ensure differential privacy for all the frequent itemsets mined from uncertain medical data.In the second phase, based on the frequent itemsets, the Laplace mechanism is applied to ensure differential privacy for the top  most frequent itemsets for uncertain data, as well as the expected supports of these frequent itemsets.We used the spare vector algorithm to improve the efficiency of our algorithm.The spare vector algorithm was used to mine the top  most frequent itemsets from certain data and guaranteed differential privacy in [15].One major advantage of the spare vector algorithm is that information disclosure affecting differential privacy occurs only for count queries above the threshold; negative answers do not count against the "privacy budget" [15].The sparse vector algorithm is also suitable for guaranteeing differential privacy when mining frequent itemsets from uncertain data.For certain data, the fixed occurrence counting of an itemset has been applied to determine whether the itemset is frequent.For mining frequent itemsets based on expected support from uncertain data, the expectation of support of an itemset has been utilized to judge whether the itemset is frequent [16].To summarize, our key contributions are the following: (i) A new algorithm is proposed to mine the top  most frequent itemsets from uncertain medical data and ensure differential privacy.Traditional algorithms for mining frequent itemsets in differential privacy ways are based on certain data and thus cannot be directly applied to process uncertain medical data.(ii) Through privacy analysis, we prove that U-Priv-Mining guarantees differential privacy in theory.Our experimental results on four real-world scenario datasets and two synthetic datasets illustrate the efficiency of U-PrivMining.
This paper is organized as follows.Section 2 presents an overview of related work on eHealth, IoT, frequent itemsets mining for uncertain data, and differential privacy.In Section 3, some notations used in this paper are introduced.The U-PrivMining algorithm and the proof that U-PrivMining satisfies differential privacy in theory are presented in Section 4. In Section 5, the performance of U-PrivMining is evaluated with six datasets.In the last section, we conclude our work.

Related Works
eHealth applies IoT technology to provide better healthcare services to users.In 2009, Niyato et al. proposed a remote and mobile patient monitoring system that applies heterogeneous wireless access to monitor the biosignals of patient mobility [17].In 2015, based on the limitations of traditional cellular networks for eHealth services, Yi et al. designed a transmission scheduling mechanism for delay-sensitive medical packets in an eHealth network [18].The eHealth system based on IoT used monitoring devices to collect large amounts of patient data.Data mining can find hidden patterns in these data, which can assist medical personnel in providing improved medical services to patients.In 2009, Karaolis et al. proposed an algorithm that used mining association rules to assess the risk of coronary events [19].When traditional data mining technologies are applied to medical data, many useless patterns are discovered.In 2013, Lee et al. proposed a novel algorithm for mining association rule to determine the relationship between blood factors and disease history [20].This algorithm reduced the number of useless patterns mined from medical data.In 2014, Park et al. used association rules mined from medical data to identify risk behaviors in daily life [21].
The phenomenon of data uncertainty is very common.Traditional algorithms for mining frequent itemsets based on certain data cannot be directly applied to mine frequent itemsets from uncertain data.There are two categories of research on mining frequent itemsets from uncertain data [22].The first category is mining frequent itemsets based on expected support.In 2007, Chui et al. proposed the notion of expected support and proposed the U-Apriori algorithm based on the Apriori algorithm [23].The second category is probabilistic frequent itemsets mining.In 2012, the characteristics of Poisson binomial distribution were introduced to mine probabilistic frequent itemsets [24].In 2012, Bernecker et al. proposed an algorithm based on the frequent pattern tree to mine probabilistic frequent itemsets from uncertain data [25].
Protecting the privacy of patient data is challenge for eHealth and plenty of studies have been conducted on eHealth security [26][27][28][29][30][31][32][33].Differential privacy can ensure that when one record in the input database of mechanism  is changed, the output of  is insensitive to the change [34].In 2006, Dwork et al. proposed the Laplace mechanism to ensure differential privacy for real-valued output [35].In 2010, Bhaskar et al. proposed an algorithm based on truncated frequencies to ensure differential privacy for the top  most frequent itemsets for certain data [36].In 2012, Li et al. introduced the notion of basis set to ensure differential privacy for mining the top  most frequent itemsets from certain data [37].In 2014, Lee et al. applied sparse vector algorithm and the Laplace mechanism to guarantee differential privacy for the top  frequent itemsets mined from certain data [15].In 2015, Su et al. introduced a smart splitting

Preliminaries
The fundamental notions of mining frequent itemsets from uncertain data [23] and differential privacy [34,35] will be reviewed in this section.These fundamental notions are used throughout this paper.The terms "item" and "symptom" are used interchangeably; "itemset" and "symptom set" can be swapped.

Frequent Itemsets Mining for Uncertain
, which indicates the likelihood that V   appears in   .For example, let  = {hypotension, eating disorder, anemia, neurasthenia}.The uncertain data is shown in Table 1.We can obtain the information from Table 1 as follows. = { 1 ,  2 } and  1 = {(hypotension: 0.3), (eating disorder: 0.1)}, which means that user  1 may be suffering from hypotension and eating disorder.The probability of {hypotension} existing in  1 is equal to 0.3; in other words, (hypotension ∈  1 ) = 0.3 .This means that the probability of user  1 suffering from hypotension is equal to 0.3.
A set of possible worlds (possible certain database), denoted as  = { 1 ,  2 , . . .,  || }, can be inferred from uncertain data .According to the existing probabilities (V  ∈   ), each possible world   (1 ≤  ≤ ||) is illustrated by generating   ∈ .Table 2 shows a set of possible worlds inferred from the uncertain data shown in Table 1.For instance, the possible world  2 = {{hypotension}, {anemia, hypotension}} in Table 2 means that the user  1 is suffering from hypotension and user  2 is suffering from anemia and hypotension.
We assume that all the records in the uncertain data and all the uncertain items in the same record are mutually independent.The probability of a possible world   , denoted as (  ), can be obtained by the following [23]: where (  ,   ) denotes the set of items contained in record   and belonging to   .The expected support of itemset , denoted as   (), can be obtained by the following [23]: where (,   ) is the support count of itemset  in possible world   .For Table 2, in  2 , we can obtain the information as ( 2 ) = 1×(1−0.3)×1×0.7×(1−0.6)= 0.196, ( 2 ,  1 ) = {hypotension} and ( 2 ,  2 ) = {anemia, hypotension}.

Differential Privacy.
Differential privacy can ensure that output of the analysis mechanism is insensitive to changes in input records.If an analysis mechanism ensures differential privacy, its output will be insensitive to the addition or removal of a record from the input database.As a result, the output cannot be used by adversaries to gain access to a patient's record using their background information [35].Many studies on privacy protection are based on two assumptions.The first assumption is that the background information of adversaries is already known to the security manager.
The second one is that the security manager has known which information should be kept private for users.Differential privacy can protect sensitive information of users without that information [34].Two databases,  1 and  2 , are a pair of neighboring databases if and only if they differ by no more than one record.
Definition 1 (M-differential privacy [34]).Let Range() be the domain of a random algorithm 's output. and   are any pair of neighboring datasets.If (3) is satisfied, then algorithm  guarantees M-differential privacy.
where M is the privacy budget of differential privacy and  ∈ Range().The sensitivity is used to obtain the maximal possible difference value between outputs for any pair of neighboring datasets.
Definition 2 (sensitivity [34]).Given the function  :   →   , the sensitivity of , denoted as Δ, can be obtained by where  and   are any pair of neighboring datasets.

U-PrivMining Algorithm
This section introduces the U-PrivMining algorithm to determine the top  most frequent itemsets from uncertain data, in which each item corresponds to a symptom of patients, in a differentially private way.The process of U-PrivMining consists of two phases.In the first phase, the assigned privacy budget is equal to M 1 = ⋅M.In the second phase, the assigned privacy budget is equal to M 2 = (1 − ) ⋅ M. The parameter  ∈ (0, 1) is applied to control the value of the privacy budgets assigned in the two phases.In this study, we chose  = 1/3 for all uncertain data.However, this choice may not be optimal.It appears that the optimal allocation depends on the characteristics of the uncertain medical data and value of  [36].

Description of U-PrivMining.
The whole process of U-PrivMining is introduced in this section.U-PrivMining is composed of two phases.In the first phase, we can obtain  so that the expected supports of the top  most frequent itemsets are greater than or equal to .The privacy budget allocated to this step is equal to M 1 = (1/3) ⋅ M. On the basis of traditional algorithm for mining frequent itemsets from uncertain data, we apply the sparse vector algorithm [15] and Laplace mechanism to ensure (M/3)-differential privacy for this phase.The steps in the first phase of U-PrivMining are as follows.
Step 1.The expected support of the th most frequent itemset, denoted as   , is obtained by utilizing traditional algorithms for mining frequent itemsets based on expected support from uncertain data.
Step 2. The noisy threshold, denoted as    , can be obtained by where Lap(12/M) is the noisy data generated by the Laplace distribution, whose mean and scale are 0 and (12/M), respectively.
Step 3. On the basis of traditional algorithms for mining frequent itemsets from uncertain data, the sparse vector algorithm is applied to obtain all the frequent itemsets whose assessment expected supports are greater than or equal to the noisy threshold    .The assessment expected support of an itemset , denoted as    () can be obtained by where   () is the expected support of itemset  and Lap(4/M) is the noisy data generated by the Laplace distribution, whose mean and scale are 0 and (4/M), respectively.
Step 4. All the frequent itemsets obtained in Step 3 and the expected supports of these itemsets are taken as the output of this phase.
In the second phase, according to the output of the first phase, U-PrivMining can obtain the top  most frequent itemsets for uncertain data and the noisy expected supports of these frequent itemsets.The privacy budget allocated to the second phase is equal to M 2 = (2/3) ⋅ M. The privacy budgets allocated to ensure differential privacy for the top  most frequent itemsets for uncertain data and for the expected supports of these itemsets for uncertain data are equal to M 2,1 =  ⋅ M 2 and M 2,2 = (1 − ) ⋅ M 2 , respectively.The second phase of U-PrivMining is described below.
Step 1 (if || is less than or equal to ,  is equal to 0).All the itemsets in  belong to the top  most frequent itemsets for uncertain data.And then Step 3 is directly executed.
Step 2 (if || is greater than ,  is equal to 0.5).The perturbation expected supports of all the itemsets in  can be obtained.The perturbation expected support of itemset ℎ  (1 ≤  ≤ ||), denoted as (  ), can be obtained by where  1 ,  2 , . . .,  || are mutually independent and drawn from the Laplace distribution, whose mean and scale are 0 and (||/M 2,1 ), respectively.The top  most frequent itemsets for the perturbation expected supports in  are the top  most frequent itemsets for uncertain data.
Step 3. Let  = { 1 ,  2 , . . .,   } be the set of the top  most frequent itemsets for uncertain data, which are obtained in above steps.The noisy expected supports of all the itemsets in  can be obtained.The noisy expected support of itemset   (1 ≤  ≤ ) can be obtained by where  1 ,  2 , . . .,   are mutually independent and drawn from the Laplace distribution whose mean and scale are equal to 0 and (/M 2,2 ), respectively.
Step 4. The top  most frequent itemsets for uncertain data and the noisy expected supports of these itemsets are taken as the output of U-PrivMining.

Privacy Analysis for U-PrivMining.
In this section, we prove that U-PrivMining is M-differentially private.In order to prove that U-PrivMining guarantees differential privacy, we introduce the notions of count query set and threshold query set.
Definition 5 (count query set [15]).Let  = { According to the definition of count query set, the sensitivity of the count query and count query set can be obtained as follows.Proof.According to (2), we can obtain the other method to compute the expected support of an itemset , denoted as   (), as follows [23]: where  is the number of records in an uncertain data  and   (1 ≤  ≤ ) is a record in .Let  and   be a pair of neighbor databases.Let  = { |  ∈  ∩  ∈   } be the intersection of  and   .Let ||, |  |, and || be the total size of ,   , and , respectively.Let    () and     () be the expected supports of itemset  for  and   , respectively.According to (10), the values of    () and     () can be computed as follows: where  be the noisy threshold for a pair of neighboring databases  1 and  2 , respectively.According to Definition 1, ( 12) is satisfied.
Proof.According to Lemma 7, the sensitivity of obtaining the expected support of an itemset is equal to 1. Therefore, the sensitivity of obtaining the expected support of all frequent itemsets in  is equal to ||.In || that is greater than , according to the Laplace mechanism, the scale of the Laplace distribution, which is used to ensure differential privacy for the top  most frequent itemsets, is equal to (||/M 2,1 ).Hence, obtaining the top  most frequent itemsets ensures M 2,1 -differential privacy.The sensitivity of obtaining the expected supports of the top  most frequent itemsets is equal to .According to the Laplace mechanism, the noisy data, which is used to obtain the noisy expected support of the top  frequent itemsets, obeys the Laplace distribution whose scale is equal to (/M 2,2 ).Hence, it ensures M 2,2 -differential privacy for obtaining noisy expected supports of the top  most frequent itemsets for uncertain data.As a consequence, according to Lemma 4, the second phase of U-PrivMining guarantees (2M/3)-differential privacy.
According to analysis of the two phases of U-PrivMining, we can conclude that the first and second phases are (M/3)differentially private and (2M/3)-differentially private, respectively.According to Lemma 4, U-PrivMining is M-differentially private.

Experiments
In our experiments, four real-world scenario datasets and two synthetic datasets were utilized to verify the efficiency of U-PrivMining, which can be downloaded from [39].The parameters of these public datasets are shown in Table 3, where the number of items in the datasets is denoted as  and the number of transactions in the dataset is denoted as .The maximal length of transactions in the dataset is denoted as max||.The average length of transactions in the dataset is denoted as avg||.In order to add uncertainty to these datasets, an existential random probability in the range of [0, 1] is assigned to each item in each transaction.

Evaluation Metrics.
U-PrivMining applies the Laplace mechanism and the spare vector algorithm to ensure differential privacy for the top  most frequent itemsets for uncertain data and the expected supports of these frequent itemsets.The Laplace mechanism can protect the privacy of U-PrivMining's output by adding noisy data to the output of As described in Definition 10, for all the itemsets mined by U-PrivMining, the precision is utilized to evaluate the proportion of itemsets mined by U-PrivMining and belonging to the correct top  most frequent itemsets for uncertain data.The recall is also used to evaluate the proportion of itemsets mined by U-PrivMining and belonging to the correct top  most frequent itemsets for uncertain data.The F-score is the harmonic mean of both precision and recall.When the number of the frequent itemsets obtained from the first phase of U-PrivMining is greater than or equal to , the value of || and |  | is equal to .As a result, the value of F-score and recall is equal to the value of precision.
As described in Definition 11, the value of RE is utilized to evaluate the influence of the noisy data on the noisy expected supports of the top  most frequent itemsets for uncertain data.There may be extremely large or small values in the experimental results.The median was not skewed because these values were extremely large or small.Therefore, the median was applied to evaluate the relative error.or domain experts [7].In order to evaluate the influence of privacy budget on the F-score and RE, we conducted four group experiments.The  values were set as 50, 100, 150, and 200, respectively.Figure 1 shows the results of the F-score obtained by U-PrivMining running on the six public datasets under different privacy budget values.As it can be seen from the figure, when  value is fixed, the F-score fluctuates and is close to 1 with increasing privacy budget.In the first phase of U-PrivMining, the algorithm obtains noisy data to generate noisy threshold and assessment expected supports of itemsets.According to (6) and ( 7), the greater the value of the privacy budget, the smaller the scale of Laplace distribution used to generate the noisy data in this step.In the second phase of U-PrivMining, the algorithm can obtain the top  most frequent itemsets by adding noisy data to the expected support.The noisy data is drawn from the Laplace distributions, whose mean and scale are equal to 0 and (/M 2,1 ), respectively.As a result, the F-score improves and is close to 1 with increasing privacy budget.From Figure 1, we can conclude that the lower the expected supports of the top  most frequent itemsets for the uncertain data, the lower the convergence speed of the F-score.For the T10I4D100K dataset, the expected supports of the top  most frequent itemsets are less than other datasets.Therefore, the convergence speed of U-PrivMining running on the T10I4D100K data set is lower than that of U-PrivMining running on the other datasets.U-PrivMining applied the Laplace mechanism to ensure data privacy.Hence, if the noisy data is relatively greater for the expected supports of the top  most frequent itemsets, then the F-score of U-PrivMining is relatively lower.

Analysis of Experimental
Figure 2 shows the RE results obtained by U-PrivMining running on six public datasets under different privacy budget values.When  is a fixed value, with increasing privacy budget, the value of RE fluctuates and is close to 0. The noisy expected support of an itemset is obtained by adding the noisy data drawn from the Laplace distribution to the expected support of the itemset.As a consequence, when the   and privacy budget values are fixed values, the expected supports of the top  most frequent itemsets for the different datasets are lower, and the RE of U-PrivMining is higher.When the privacy budget is a fixed value, and, with increasing , the lower expected supports of the top  most frequent itemsets for different datasets, the higher RE of U-PrivMining.For the same dataset and privacy budget, RE values increase with increasing .The noisy expected support of an itemset can be obtained by adding the noisy data drawn from the Laplace distribution to the expected support of the itemset.

Discussion.
In the field of medicine, for different physical conditions of patients, the same physiological index corresponds to a different symptom association probability for each patient.There are plenty of medical technologies to obtain symptom association probability for patients.There is uncertainty in patient data.However, existing algorithms for mining frequent itemsets from medical data in differentially private ways are all based on certain data and cannot be directly used for uncertain medical data.Therefore, in this paper, we proposed the U-PrivMining algorithm, which can mine the top  most frequent itemsets from uncertain medical data and ensure differential privacy.The experimental results verified the effectiveness of U-PrivMining.

Conclusion
In this paper, we proposed a new algorithm to mine the top  most frequent itemsets from uncertain medical data, where each item corresponds to a patient symptom, while protecting data privacy.These frequent itemsets can assist physicians in making diagnoses.Through theoretical and experimental analyses, we can conclude that not only does U-PrivMining ensure differential privacy but, with increasing privacy budget, the top  most frequent itemsets obtained by U-PrivMining and the noisy expected supports of these frequent itemsets are close to the true top  most frequent itemsets and expected supports of these itemsets for uncertain data, respectively.However, the privacy budget allocation may not Programming 9 be optimal.The optimization of privacy budget allocation will be focus of future research.

Figure 1 :
Figure 1: F-score by varying privacy budget.

Figure 2 :
Figure 2: RE by varying privacy budget.
[15] 2 , ...,  || } be a set of itemsets with || itemsets.A count query set is composed of a number of queries.Let  = ( 1 ,  2 , ...,  || ) be the count query set, where each query   (1 ≤  ≤ ||) asks for the expected support of the th itemset in .Definition 6 (threshold query set[15]).Let  = { 1 ,  2 , . . ., Based on the sensitivity of the count query and count query set for uncertain data, we can conclude that U-PrivMining guarantees M-differential privacy.The proof procedure is outlined below.

Table 3 :
[15]set.Definition 11 (relative error[15]).Let   () and () be the expected support of itemset  for uncertain data and the noisy expected support of itemset , respectively, which is obtained in the second phase of U-PrivMining.
Results.U-PrivMining can identify the top  most frequent itemsets from uncertain data in differentially private way.In traditional algorithms for mining the top  most frequent itemsets from uncertain data and certain data, the  values were predetermined by users