A Privacy-Preserving Intelligent Medical Diagnosis System Based on Oblivious Keyword Search

1Network and Information Center, Institute of Network Technology, Beijing University of Posts and Communications, Beijing 100876, China 2Science and Technology on Information Transmission and Dissemination in Communication Networks Laboratory, Beijing 100876, China 3National Engineering Laboratory for Mobile Network Security (No. [2013] 2685), Beijing 100876, China 4Network and Information Center, Institute of Network Technology and Institute of Sensing Technology and Business, Beijing University of Posts and Communications, Beijing 100876, China 5School of Computer Science and Technology, Nanjing Normal University, Nanjing, Jiangsu 210023, China 6Department of Informatics, University of Leicester, Leicester LE1 7RH, UK


Introduction
Due to the increasing health consciousness, the intelligent medical diagnosis system has received immense popularity around the world.Nowadays, patients can get personalized online medical services based on the original health data submitted by themselves, without the need for doctors.That is, patients can obtain a reasonable diagnosis anytime and anywhere, as long as they have health data and are willing to transfer them to the server.People are getting more and more accustomed to convenience that is made possible by the IMDS.Moreover, this kind of intelligent medical service is inexpensive, which means, via popularizing IMDS, we can reduce the expenditure of public medical services.All the above shows that IMDS has a bright future and will be an essential part in the future life [1][2][3].
However, the privacy disclosure issue has become a big obstacle for developing IMDS.Patients are afraid that their sensitive personal information may fall into wrong hands.In most cases, poor security mechanisms and weak safety awareness are main reasons for the information leakage.For example, in 2009, the AvMed Health Plans, a large nonprofit US health plans org, exposed the personal information of 200,000 subscribers and their dependents, as a result of the theft of two company laptops that contain sensitive information [4].Types of divulged personal information include names, addresses, phone numbers, social security numbers, and protected health information.For this reason, designing a privacy-preserving IMDS, which can ensure the privacy of patients, is an urgent task.Besides, the security of the server also cannot be ignored.Currently, on this respect, various schemes have been proposed [5][6][7].One example is [8]; it designs a privacy-preserving recommendation system using homomorphic encryption.This system allows patients to rate physicians based on their satisfaction, so that other patients can choose a popular physician via these ratings.Another paper that also describes a physicians recommendation system based on hybrid matrix factorization is raised by [9].This paper applies text sentiment analysis to analyze patient comments that increase accuracy in grading physicians.
In this paper, combining the additive homomorphic cryptosystem and the oblivious keyword search, we propose a privacy-preserving IMDS, and the security of data stored in server can also be ensured.Patients can get an effective and reasonable diagnosis in our system, after uploading their health examination data to the server.The diagnosis will tell the patient which parameters are not in the normal range, and which diseases he is likely to get.Moreover, our system contains a preprocessing phase to reduce the calculation amount of search.
The rest of the paper is organized as follows.Section 2 gives a brief overview of related works.In Section 3, we show the framework and notions.Section 4 provides the detailed description of our protocol.Section 5 analyzes the security of this system.In the end, Section 6 discusses the performance of this system and concludes the paper.

Preliminaries
2.1.Paillier Cryptosystem.In 1999, Paillier proposed a homomorphic public key cryptosystem called Paillier cryptosystem, which is a common encryption scheme used for data protection [10].The homomorphic property of this cryptosystem means that people can get the sum of two plaintexts, according to decrypting the product of their corresponding ciphertexts.Besides, Paillier cryptosystem is a semantically secure cryptosystem, which means no information about the plaintext can be obtained from the according ciphertext.Therefore, calculations over ciphertexts in Paillier cryptosystem will not reveal any extra information.The Paillier cryptosystem is briefly described as follows.
Let  be a message,  and  are two large prime numbers,  = ,  is a random integer, and  is a random number.The encryption of  is defined as ( The decryption of  is defined as where () = ( − 1)/,  = ((  mod  2 )) −1 mod , and  = ( − 1,  − 1).

Oblivious Keyword Search (OKS).
Oblivious keyword search (OKS) protocol was proposed by Wakaha Ogata and Kaoru Kurosawa in 2002; it is a secure keyword search scheme between two parties [10].In an OKS protocol, there is a server  that maintains some secret data and a user  that is allowed to search for the data containing the keyword chosen by the user; this chosen keyword is secret to .Next, we introduce an efficient -out-of- OKS protocol based on RSA blind signature (OKS   protocol) as follows.In the OKS   protocol,  stores data  1 , . . .,   .Define where Δ = { 1 , . . .,   } is the set of keywords,   ∈ Δ, and   is the corresponding content.
2.2.1.Submit Phase. generates a public key (, ) and a secret key  of RSA and then for  = 1, . . .,  computes where  is a hashing function and  is a pseudorandom generator.Then  send  1 , . . .,   to .

Transfer Phase.
At each transfer round , firstly,  choose a keyword  *  and a random element  to compute and then  send  to .Secondly,  computes   =   mod  and sends it to .  computes  =   / = ( *  )  mod .Finally, let  = 0;  is the set that stores the search results.For  = 1, . . ., ,  computes If   = 0  , then  add (,   ) to .

Useful Tools.
We introduce here two generic subprotocols from literatures [8,[11][12][13], which will be employed in our protocol, and all these protocols can be easily implemented by using the Scalar Product Protocol; it is a standard protocol mentioned in [13].Moreover, let [] denote an encrypted value of .The subprotocols are described as follows: (1) Bits(, []): return []  , which is an encryption of  least significant bits of the plaintext of []; that is,

The Framework
This section we develop the framework of our privacypreserving IMDS.At the beginning, we describe two entities interacting with each other: the user  and the server .
In the field of , each  measures  private health examination parameters of his own body in advance, such as blood glucose, vital capacity, and vitamin content. uploads those health data to  for getting health service.We denote  parameters that  measured as  1 , . . .,   and denote the corresponding measured value of  1 , . . .,   as  1 , . . .,   .In the aspect of , it stores a wide range of diseases  1 , . . .,  ℎ and generally more health examination parameters  1 , . . .,   than ,  ≤ .Besides, we define V  as the parameter vector of   ∈ { 1 , . . .,  ℎ }, which represents the relationship between the disease   and all kinds of parameters  1 , . . .,   stored in .Suppose the th bit of V  is 1, which means when disease   occurs, the measured value of parameter   will be smaller than the normal range.Similarly, 2 means   will be larger than the normal range, and 0 means   will in the normal range.Obviously, V  has  bits.Thus, each   ∈ { 1 , . . .,   } has the upper limit   and the lower limit   to represent the normal range.In , the above-mentioned   and V  are represented by  1 , . . .,  ℎ .Define Next we introduce our framework.It contains three phases: submit phase, preprocessing phase, and search phase.The submit phase builds on the Paillier cryptosystem. firstly uploads his health examination data [ 1 ], . . ., [  ] to  in the form of encryption, using a secret key generated by the Paillier cryptosystem.At the meantime,  directly tells  which parameters he has uploaded, namely,  1 , . . .,   .Then,  compares all {[  ]} ∈ [1,𝑛] with [  ] and [  ] to get the parameter vector [V  ], which represents whether each health examination parameter of  is in the normal range.The generation procedure of V  is depicted in Figure 1.In addition, the comparison is completed by functions Bits and Min mentioned in Section 2. In this comparison,  has no knowledge of the uploaded data but maintains the computational ability.Finally,  returns [V  ] to  in encrypted form.
To facilitate the following search operation and make it faster, we propose a transform method in the preprocessing phase. uses this method to split and reshape all V  into the keywords set WP, and reorganizes  1 , . . .,  ℎ to the new data structures  1 , . . .,  ℎ * , which are pairs of keyword and disease.After that, to prepare the oblivious keyword search (OKS) used in the search phase,  calculates  1 , . . .,  ℎ * and sends them to .  1 , . . .,  ℎ * can be understood as the encrypted form of  1 , . . .,  ℎ * .In the search phase, we utilize the oblivious keyword search (OKS) to realize privacy-preserving search.Firstly,  constructs a set of keywords on the basis of V  and successively calculates  for each keyword and sends them to .We can also understand  as the encrypted form of a keyword.Then,  computes  using  and returns it to .Next,  retrieves keywords constructed by V  in the  1 , . . .,  ℎ * via (6) and thus gets their corresponding diseases.Finally, according to the search results,  finds the disease with the highest frequency as the diagnosis for .The whole structure of our IMDS is depicted in Figure 2.

Our Protocol
In this section, we detail the implementation of our protocol.The comparison results will be transferred to  after being encrypted.According to these results,  constructs the parameter vector V  .We define the parameter vector V  of the user  as follows: The algorithm is shown in Algorithm 1.

Preprocessing.
To make following search process faster, data  1 , . . .,  ℎ stored in  need to be reorganized into a new structure.Let  = { 1 ,  12 } ∈[1,] be the set of keywords.Define where  represents the mark number of   .That is, a keyword is obtained by concatenating the mark number of a parameter with its values in V  .We should note that only those abnormal parameters will be selected to create keywords.Now we define the new data structure as where   ∈  and  *  is one of the diseases whose parameter vector V contains   .The specific transformation method is described in Algorithm 2.
Next,  generates a public key (, ) and a secret key  of RSA and then only publishes (, ).With the secret key  and the hash value of each   ∈ ,  computes   and   .Finally,  outputs {  } ∈[1,ℎ * ] to , where ℎ * = 2.In addition, let  be a security parameter,  be a pseudorandom generator, and  be a hash function.The computational process is also described in Algorithm 2.

Transfer Phase.
In the search phase,  successively finds out abnormal parameters by judging whether their values in V  equal 0 and constructs keywords for them.For example, if the th bit of V  equals 0, then   is a normal parameter for ; otherwise we say   is an abnormal parameter.At each transfer round , after choosing an abnormal parameter,  creates a keyword  *  by concatenating the mark number of chosen parameter with its values in V  .Then,  calculates  and sends it to , which is the encryption of  *  . returns the search result   .According to decrypting   ,  retrieves keywords constructed by V  in the  1 , . . .,  ℎ * via (6) and thus gets their corresponding diseases.We maintain these diseases in a list as suspected diseases at each transfer round and denote this list as .When all abnormal parameters are traversed, we compute the frequency of occurrence of diseases in , and output a diseases list  max that consists of diseases with the highest frequency as the final search result.The algorithm is described in Algorithm 3.

Safety Analysis
We present here the analysis of the security of our system.Lemma 1.The problem of computing th residue classes is believed to be computationally difficult.

Lemma 2. RSA known target inversion problem (RSA-KTI) is hard.
Firstly, we discuss the security of .In the submit phase, {  } ∈[1,] , the health examination data of the patient  are these data that the attacker  wants to steel.However, the only message that  may get is ciphertext [  ], which is encrypted by the private key  of .The Paillier cryptosystem produces this  and it is the basis of this phase.If  wants to get the plaintext   without knowing private key, his task is, given a composite  and an integer , deciding whether  is -residue modulo  2 or not.By Lemma 1, we know that to complete this task is hard.Therefore,  can not get the plaintext   .In the search phase, 's important and private data is  *  , but  has no information on  *  because they are blinded in the RSA blind signature scheme.
Next, we prove the security of , assuming attacker  is allowed to make at most  queries to .At first,  behaves as if it were . generates and sends (, ) to .After that,  randomly chooses   and sends them to .From Lemma 2, we could know it is hard for  to get the plaintext.

Conclusion
In this paper we propose a privacy-preserving intelligent medical diagnosis system and also discuss how privacypreserving protocols can be used for protecting sensitive patient data in medical scenarios.This system applies two security protocols, Paillier cryptosystem and oblivious keyword search (OKS), to medical diagnosis, and it can be put into practice.Besides, our system also has following advantages: (1) Previously mentioned information security requirements are achieved.That is, privacy of patient data and security of server get properly protected.Thus, server is blind to the personal information of patients; patients also know nothing about the data maintained in server.
(2) Our system reduces the calculation amount of search by adding the preprocessing phase.This phase can link each keyword  with the corresponding disease name .Hence, we can directly get the  after  is retrieved instead of looking up for  in the database.
In the search phase, system only needs to retrieve  submitted parameters instead of every parameter in the database, which also makes search phase more efficient.
(3) System is able to support multiple possible diseases instead of a single result for patients.When the submitted data are not enough to determine only one disease, system will show patients several results for reference.
Therefore, our privacy-preserving intelligent medical diagnosis system is capable of providing efficient and reasonable diagnosis for patients.We view this work as a start of our follow-up work.There are still a lot of work to be done.In the future, we will focus on the research for multiuser and multiserver IMDS.It is necessary for developing a system that can simultaneously and securely provide service for users.

Figure 1 :
Figure 1: The structure of our intelligent medical diagnosis system.

Figure 2 :
Figure 2: The structure of our privacy-preserving intelligent medical diagnosis system.