Healthcare Big Data Privacy Protection Model Based on Risk-Adaptive Access Control

Edge computing is playing an increasingly important role in the field of health care. Edge computing provides high-quality personalized services to patients based on user and device data information. However, edge nodes will collect a large amount of sensitive patient information, and patients will also bear the risk of privacy disclosure while enjoying personalized services. How to reduce the risk of privacy disclosure while ensuring that patients enjoy personalized services brought by edge computing is the research content of this paper. In this paper, the work flow and management mode of Hospital Information System (HIS) are investigated on the spot, and the risk-adaptive access control model based on entropy is established. First, we use International Classification of Diseases, Tenth Revision (ICD-10) to mark the information resources accessed by users and use information entropy to measure the correlation “α” between medical information accessed by users and work tasks. Finally, we analyze the relationship between correlation “α” and risk through an example. )e results show that users with high correlation α have low risk of access behavior, and users with low risk have high correlation α of access information resources and work goals. )is discovery can help managers predict users’ access behavior in the Big Data environment, so as to dynamically formulate access control policies according to the actual access situation of users and then realize the privacy protection of medical big health data.


Introduction
Edge computing is used to extend cloud computing to the edge of the Web. Specifically, edge computing is a new distributed computing mode, in which multiple edge nodes located between cloud servers and local users cooperate to complete outsourced storage and computing tasks [1]. Edge computing is increasingly used in various areas of people's life, including smart home, healthcare, industrial production, and media entertainment [2]. Especially in healthcare, medical informatization based on edge computing technology can greatly improve medical efficiency. At present, the research of edge computing combined with medical scenarios mainly focuses on the design of network communication protocol and routing algorithm but neglects the research of information security and privacy protection in medical scenarios. As a result, the development of current medical application scenarios based on edge computing technology is blocked. As medical Big Data contains a wealth of patient-sensitive information, the collection, transmission, use, and sharing of data bring great security risks to people.
Medical Big Data is the core asset of the medical field. However, most hospitals lack special privacy protection measures, and the medical industry has become one of the most serious areas of privacy leakage [3]. In 2016, 10 industries with serious data leakage were released in the Internet Security reat Report. e medical industry ranked first, with 116 data leakage incidents, and the incident incidence rate was 37.2%. e proportion of data leakage is much higher than the second-ranked retail industry. In 2017, 47 GB of medical information was leaked by Amazon server and the private information of 150,000 patients was exposed, which brought great negative impact to both the society and individuals. erefore, it is very necessary to study the security and privacy issues in the medical industry.
At present, the research on data security and privacy protection technology mainly includes access control, data anonymization, data encryption, differential privacy protection, digital watermarking, and so on. Among them, access control technology has become a hotspot of current research, but it mainly targets at the field of operating system, and there are not many researches in the information field, especially the research on the security and privacy protection of medical Big Data. Inspired by previous studies [4][5][6], this paper proposes a medical Big Data privacy protection model based on risk-adaptive access control. e main contributions are as follows: (i) We build a set of diagnostic codes that users can access under specific work goals. Based on the ICD-10 diagnostic codes, we cluster the set of diagnostic codes that allow users to access under the target disease. (ii) We design a method to evaluate user access behavior. According to the user's historical access data, we use information entropy to evaluate the risk of users' current request access behavior. (iii) We propose a new method for calculating risk benchmarks. Based on the user's information entropy, we use K-means clustering to calculate the baseline value of risk assessment. e rest of the paper is organized as follows. Section 2 introduces the related work about medical data privacy protection. Section 3 specifically introduces the construction of the medical Big Data privacy protection model. Section 4 is the experimental simulation. Section 5 summarizes the paper and discusses directions of future work.

Related Work
Although edge computing improves medical efficiency, medical data may leak and cause serious damage to patients in the process of transmission and access. erefore, the security and privacy protection of mobile medical information are particularly important. It is urgent to establish privacy security guarantee mechanism and monitoring mechanism to ensure the safe reading and access of medical data. e current research on Big Data security and privacy protection mainly adopts technologies based on cryptography, differential privacy, anonymization, identity authentication, and so on. Gao et al. [7] proposed the Sensitive Data Timed I/O Automata (SDTIOA) model, which was an automatic transformation method for modeling the timed privacy requirements of IoT service compositions. Gao et al. [7] used the SDTIOA model to verify whether the service combination meets user privacy requirements, which can effectively prevent the leakage of user privacy information. In literature [8][9][10][11][12][13][14][15][16][17][18][19], scholars used differential privacy technology to establish some privacy protection models for medical Big Data security issues. Zhang et al. [20] used encryption technology to study the security of medical data according to the sequence of events; He et al. [21] protected medical information through anonymization and identity authentication. In order to protect the privacy of medical data, Li and Zhang [22] chose to desensitize and anonymize their EMR data before authorizing a third party to use it. Chen et al. [23] proposed an electronic medical record system based on blockchain joint proxy re-encryption, which ensured the security of medical data access and realized fine-grained access to data through attribute-based access control. Xanthidis and Xanthidou [24] designed an error-correction code hash function and constructed a privacy-preserving anonymization algorithm, which to some extent controlled users' access rights and ensured the safe sharing of data between doctors and patients. Some scholars used VPN, SSL and other technologies to control access to medical data, so as to protect the security of medical data. Malasri and Wang [25] proposed SNAP (sensor network for assessment of patients) scheme to solve the safety problems of wireless sensor monitoring network. In Sun et al.'s study [26], patients' physiological signals (such as blood pressure and heart rate) were used to generate symmetric keys with patient characteristics, so as to protect the data security of patients. However, these schemes are time-consuming and not practical for medical scenarios requiring high delay.
In recent years, scholars at home and abroad have also carried out research on risk-based access control technology. In Jason et al.'s [27], the author first put forward the concept of risk and gave the primary colors and suggestions related to risk quantification. Dankar and Badji [28] proposed a riskaware information disclosure model for biomedical data. Model evaluated the risk posed by a data request using all contextual information surrounding the request and feed it into an access control decision module. Ni et al. [29] and Cheng et al. [30] presented specific risk quantification methods according to the safety marks and sensitivity of the subject and object. In addition, a role-based access control (RBAC) model based on risk perception was proposed, which mainly evaluated the trust of users, the relationship between users and roles, and the relationship between roles and permissions. Diep et al. [31] and Sharma et al. [32] mainly conducted risk assessment on the user's access behavior, and the evaluation basis was whether the user's access behavior will cause loss of the integrity, availability, and confidentiality of the information. Ding et al. [33] proposed a privacy-preserving multiparticipant risk-adaptive access control model. is model proposed a privacy quantification method for dynamically accessing data. Furthermore, a multiparticipant access control evolutionary game model was constructed based on evolutionary game. Yang et al. [34] designed a flexible access control mechanism based on keyword matching, which enabled data to be shared in a fine-grained access control mode, preventing privacy disclosure while not affecting data usage. Line et al. [35] summarized a variety of access control strategies and analyzed risk assessment criteria according to the actual situation of hospitals, and suggested that future work should focus on expanding access control strategies based on location and situation; Wang and Jin [36] evaluated users' access behaviors based on their historical access information. e more chaotic the distribution of users' access behaviors, the greater the risks that users may cause. Shaikh et al. [37] proposed a risk-based decision access control system, which took into account not only the historical access behavior of users but also the recent access behavior of users, and the system was suitable for dynamic and complex environments like the medical industry. Choi et al. [38] constructed a context-aware medical information risk access control framework, which mainly judged whether to grant users access permissions based on permission files, user access logs, context information, etc.; Hui et al. [39] improved on the basis of literature [36] to prevent doctors from stealing patients' private information by forging work goals. Zhang [40] and Jiang et al. [41] mainly studied the privacy disclosure of medical Big Data in the cloud environment but focused on the analysis of the risk indicator system that may affect privacy disclosure, without involving specific risk quantification model. Few previous studies [42][43][44] established a risk assessment model for medical Big Data with the help of fuzzy theory.
A comprehensive analysis of relevant research shows that some scholars are currently doing research on the intersection of information and medicine and have made good achievements. e methods mainly focus on cryptography, anonymity, differential privacy, and so on, and some are analyzed from the perspective of management. Although there are also studies from the perspective of risk and access control, they are still in the initial stage of exploration and have not formed a relatively mature system model framework, especially for the research on the privacy protection of medical Big Data based on risk access control is extremely scarce.

Medical Big Data Privacy Protection Model
Workflow refers to the automation of part or whole of a business process in the computer application environment. It is an abstract and general description of the business rules between the workflow and each operation step [45]. Before studying the privacy protection of medical Big Data, we should be clear about the user's workflow, authorization management mode and information use process in HIS, otherwise it will be meaningless to discuss privacy protection apart from the actual situation. erefore, Section 3.1 first investigates and sorts out the workflow and authorization management mode in HIS, and then establishes the access control model according to the actual situation.

Workflow and Authority Management Mode in HIS.
rough the field investigation of HIS in some hospitals in Kunming, we found that the system generally includes four main modules: outpatient workflow, inpatient workflow, permission allocation, and drug storehouse, while the first three modules are mainly involved in the study of medical Big Data privacy issues. In the outpatient workflow, the outpatient cashier is responsible for logging in the system to fill in the patient's registration information, outpatient fees, and refund processing; the outpatient doctor selects the department responsible for issuing medical advice. In the inpatient workflow, the inpatient nurse is responsible for the patient's admission registration, prepayment entry, patient admission, filling in the admission diagnosis information, and patient's basic health information; the resident is responsible for prescribing shortterm or long-term medical advice to the corresponding patient, which is reviewed and implemented by the resident nurse.
In terms of authority allocation, most hospitals adopt role-based authorization management mode.
e system administrator first adds the staff's basic information in the basic information bar of the personnel management module. e hospital mainly includes outpatient department, inpatient department, drug system, clinical department, medical technology department, hospital leader, and so on. e departments are divided into different offices and wards. When adding the staff's basic information, the system will automatically generate the doctor's code or the employee number, and the code is needed in the permission allocation.
en the administrator fills in the employee's department, section, position, title, and other relevant information in the personnel management column, and assigns the account number and password of the system to the employee. e account type includes financial personnel, outpatient financial account, outpatient doctor, system user, medical technology department, hospital office, inpatient care, resident doctor, and so on. User can view the login records of each account type. Finally, the administrator assigns roles to each user. e role management interface includes Administrators, Guests, Public, office staff, financial staff, decision analysis, developer, outpatient registration, outpatient charge, outpatient pharmacy, outpatient doctor, outpatient doctor station, personnel management, data center, resident nurse, resident doctor, medical technology department, and so on. e administrator grants different access modules to different roles. erefore, the whole process realizes the assignment relationship between user-role-permission in HIS.
From the aforementioned analysis, it can be seen that doctors in different jobs and roles have different work tasks and access authority sets. In general, after a patient is discharged from hospital, the patient's paper medical record and medical information will be sealed by the medical record room, and junior professional title doctors can only use their own employee number and password to query the patient's recent medical information. However, in order not to affect the normal work of doctors, some highly qualified doctors or experts will be granted extremely high authority, they can access patient information not only for the whole group, but even for the entire district. In addition, the amount of information that doctors need to complete their respective tasks will vary depending on the patient's medical history, the number of patients, and other factors. For example, when a patient is diagnosed with cancer, the doctor has access to a lot of sensitive cancer-related information. It is difficult to adapt to the actual situation of the medical environment to evaluate the doctor's access risk by the amount Security and Communication Networks of doctor's access information or the sensitivity of doctor's access data.
is paper is interested in whether the risks caused by users' access behavior are worth it, that is, whether to grant access to users is determined by measuring the relationship between risks and benefits.

Entropy-Based Risk-Adaptive Access Control Model.
Entropy characterizes the chaos degree of random variable distribution. e more chaotic the distribution is, the greater the entropy is. e essence of entropy is to measure the amount of information. Information entropy, also known as Shannon entropy, is usually used to describe the average amount of information brought by the whole random distribution, and has more statistical characteristics. Since entropy is calculated based on sample data, it is also called empirical entropy. e relevant formula is defined as follows.

Definition 1.
Assuming that X is a random variable and the random distribution of X is P(X), the self-information I(x) of the random variable x (x ∈ X) and the information entropy H(X) of X are According to Definition 1, entropy represents the chaos degree of random variable distribution, which can be used to formally represent the instability of user access behavior. erefore, the entropy-based risk-adaptive access control model aims to quantify users' access behaviors by means of entropy. In HIS, each user will have a corresponding work task, and access the patient's information resources according to the work task. We believe that if the information resources accessed by the user are not relevant to the work task, or the correlation is very low, the user will have the risk and possibility of snooping the patient information. A user's access behavior is set to a six-tuple: where U is the set of all users, R represents the set of roles, T represents the set of tasks, P represents the set of permissions, M is the set of medical records, and α is the correlation between medical records and work tasks. e relationship between factors is as follows: the same user can be granted different roles, and the same role can also be assigned to different users. Users and roles are in a many-to-many relationship. In Figure 1, User1 can be granted Role1, Role2, and Role4; Role1 can be assigned to both User1 and User2. Tasks are assigned to users according to roles, and roles map users to corresponding tasks. A role can be assigned multiple tasks, and a task can also be assigned to multiple roles, tasks, and roles are also in a many-to-many relationship. In Figure 1, Role3 can access the medical information resources required by Task2 and Task4, and the medical information resources provided for Task2 can be accessed by Role2 and Role3. Tasks are assigned to users through roles to perform their permissions. e specific structure and relationship are shown in Figure 1. e following will analyze in detail how to evaluate the user's access behavior. As we can be seen from the previous analysis, the correlation between the user's work task and the medical records accessed by the user is an important basis for evaluating the user's access behavior. erefore, we should first clarify two questions before evaluating users' access behavior: (a) How to mark the medical information accessed by users and (b) How to quantify the risk value of known users' historical access behavior. e rest of Section 3.2 will focus on the aforementioned two aspects to analyze the user's access behavior.

Marking Medical Information.
Most hospitals use the ICD-10 code to classify and label the diagnosis results of patients. ICD-10 is the 10th revision of the International Statistical Classification of Diseases and Health-related Problems. e disease name corresponds to a single code, and in turn, the disease name can be found through the disease classification code, and then the medical staff can extract the required medical record data. Generally, it is "one disease one code," which can provide a unified classification standard for medical record management, health statistics, medical information utilization, and scientific research.
Although there is a one-to-one relationship between disease name and coding, in practical problems, in order to fully understand the disease, it is normal for users to access some medical record information related to the disease. erefore, according to the characteristics of ICD-10 diagnostic code, the paper reasonably summarizes the medical records of similar diseases through clustering. e 6-digit ICD-10 is used in China's medical system. For example, B18.202, the first three digits are the mixed codes of letters and numbers, which are the category codes with practical significance. e fourth digit or letter code is the subclassification of the first three digits, and the more similar diseases are, the more similar their ICD-10 codes are. erefore, when the first letter is the same, the second and third digits of the ICD-10 diagnostic code are taken out, and when the second and third digits of the ICD-10 code are the same, they are grouped into one category.
Assume that the diagnostic code of a target disease is B18.902, and the set of diagnostic codes after clustering is as follows:S � B18.000, B18.001, B18.002, B18.003, B18.103 { + N08.0 * , B18.202, B18.204, B18.902, B18.904 + N08.0 * }. When the medical information records accessed by a user under a specific work task belong to set S, it indicates that the user's access behavior is within the scope of the work target. However, when the medical information records accessed by users under a specific work task do not belong to set S, we need to evaluate the instability of their access behavior and specifically quantify the correlation α. erefore, it is theoretically feasible to mark medical information accessed by users through diagnostic codes.

Risk Quantification.
We define the deviation degree between users' access to medical information and their work tasks as risks. e greater the deviation degree is, the greater the possibility of privacy snooping. When the risk reaches a certain level, users' access behavior should be controlled. We use entropy to quantify the user's access behavior. Before calculating the user's access behavior entropy, we need to define the probability P of the user's access to the medical information record set m k .
Let the patient set D � d 1 , d 2 , . . . , d I d received by user u i within a period of time, and the medical record set accessed by use u i for patient d j is M � m 1 , m 2 , . . . , m I m , then the probability of user accessing medical record set m k is as follows: where I d represents the number of patients, I m represents the number of elements in the set M, d j ∈ D, m k ⊆M is a set of diagnostic codes for a certain disease in the medical system, and each element in M represents a set of diagnostic codes, ‖f(m k )‖ represents the number of users access to the set m k . erefore, the information entropy of the medical information accessed by the user u i for the patient d j under a specific task within a period of time is is the entropy of user u i access to medical information resources when user u i treats all patients for a specific task within a certain period of time. By calculating the mean value of H(u i , d 1 ) , H(u i , d 2 ), . . . , H(u i , d I d ) , we can get the entropy H(u i ) of user u i accessing medical information resources for specific work tasks in a certain period of time.
e higher the entropy H(u i ) of the user's access to medical information, the more unstable the user's access behavior, the greater the deviation degree between the medical records accessed by the user and their work tasks, the smaller the correlation, and the greater the possibility of disclosing privacy information. erefore, the entropy H(u i ) of user access to medical information is inversely proportional to the correlation parameter α, which can be expressed as follows:

Security and Communication Networks
To sum up, we can get the entropy H(u 1 ), H(u 2 ), . . . , H(u I u ) of all users accessing medical information in the HIS within the same time period, where I u represents the number of users in the HIS. en, we take the entropy of all users' access to medical information as the input of K-means clustering, and finally obtained two clustering centers (x 1 , y 1 ) and (x 2 , y 2 ) [46]. We averaged the ordinate of the two clustering centers and used them as the benchmark π for risk assessment. erefore, the access behavior risk of each user u i is defined as

Case Analysis
We take the inpatient department as an example to analyze the effectiveness of the model in practical applications. at is, whether the model can evaluate the user's access behavior risk based on the actual situation of the hospital and the existing data resources. According to the previous analysis, when the information resources accessed by users are not related to the work objectives or the correlation α is low, the user will have the risk of disclosing the patient's privacy information. erefore, the validity of the model can be tested by studying correlation α and risk. When users with high correlation α have low risk of access behavior, and users with low risk have high correlation α between access information resources and work objectives, the model in the paper can be considered to be effective.
According to the requirements of this paper, some medical data have been obtained from the Oracle database of a third-class hospital in Kunming, including doctor code  table Dmb_ysdm, hospital department code table  Dmb_ksdm, patient basic information table Zy_hzjbxx,  doctor's order table Zy_yz, and patient inpatient information table Zy_hzzyxx. It is worth noting that each table in the hospital has more than 200 fields at most. e table shown in this paper is a regenerated table after extracting fields from multiple tables according to the needs of the model. When a doctor access electronic medical records, an access log will be generated, including the Doctor code, Patient's identification number, Medical record no., and access time. When users view the log content, they need to retrieve Dmb_ysdm and Dmb_ksdm. Under a specific work task, the doctor needs to obtain the basic information of the patient (Patient's condition, Clinic diagnosis, Admission diagnosis, ICD-10 code, medical record no., etc.) according to the patient's medical record number, and then needs to search Zy_hzjbxx, Zy_yz, and Zy_hzzyxx. e relevant information is shown in Tables 1 to 5.
As medical institutions prohibit public access to data, this paper presents only a part of patients' data in the basic patient information table Zy_hzjbxx, as shown in Figure 2.
According to Section 3.1, each user in HIS has a corresponding role and access module, and the user's access record can be viewed under the corresponding role, including doctor code, access content, access time and access frequency. In addition, by retrieving the doctor code, you can also query the user's role, department, job title, and other information. We combine Dmb_ysdm, Dmb_ksdm, Zy_hzjbxx, Zy_yz, and Zy_hzzyxx to get a user's work goals and access information resources within a period of time. In the experiment, we randomly selected 30 users from the inpatient department of HIS, and tracked and marked their access within half a year. e marking content mainly includes the patient information received by the user in the past half a year, especially the patient's ICD-10 diagnostic code and the medical information record set accessed by the user for the patient. Table 6 shows the medical information accessed by the resident u 1 when treating the patient with ID 4151097: Combining equations (3) and (4), we can calculate that the information entropy of the user accessing medical information for the patient's condition within half a year is 0.86. In the same way, according to equation (5), we can get the entropy H(u 1 ) of the user accessing medical information for a specific task (patient's diagnostic code is A19.900) within half a year, and the entropy H(u 2 ), H(u 3 ), . . . , H(u 30 ) of all users accessing medical information within the same period. e specific value of entropy of user access to medical information resources is shown in Figure 3. e entropy in Figure 3 is prepared for us to calculate the risk benchmark and assess the risk of user access behavior. We take the entropy of all users accessing medical information resources as the input data set for Kmeans clustering. Finally, we get two clustering centers (x 1 , y 1 ) and (x 2 , y 2 ), and the results are shown in Figure 4.
As we can be seen from Figure 3, in the same time period (the time set is half a year in the paper), different users have different access behavior risks for specific work tasks. is experiment proves that it is feasible for us to use the information entropy method to evaluate user access behavior's risk. It can be seen from Figure 4 that the entropy of 30 users accessing medical information is divided into two categories. e cluster center y 1 of the first type is 0.45, and the cluster center y 2 of the second type is 0.60, which are marked with red dots in the figure.
erefore, the risk benchmark π � 0.525 is obtained according to y 1 and y 2 . e extent to which the user's access behavior deviates from the risk baseline is shown in Figure 5. Where we set the risk value of users whose access entropy is below the risk baseline to be 0 and default to legitimate users, while the risk value of the user whose access entropy is above the risk baseline can be calculated according to equation (7), as shown in Figure 6.
From Figure 6, it is easy to determine whether to grant access to the user. As we can be seen from Figure 6, users whose access entropy is lower than the risk baseline are considered to be risk-free, and the model grants the user access rights. If the risk value is not 0, that is, the user's current access request may cause privacy security problems, the user's access request is rejected. Finally, in order to verify the validity of the model in this paper, we analyzed the change relationship between correlation α and risk, as shown in Figure 7. e results show that users with high correlation α have low risk of access behavior, and users with low risk         Security and Communication Networks have high correlation α between access information resources and work goals. erefore, managers can predict the risks of users' access behaviors based on the correlation between the information resources accessed by users and the work objectives. In this way, administrators can dynamically formulate access control policies based on users' access conditions.

Conclusion
In this paper, we sort out the current hospital workflow and management mode and establish an access control model based on risk adaptive. By analyzing the correlation between information resources accessed by users and work objectives, we assess the risk of patient information disclosure caused by users' access behavior. e experimental results show that hospital administrators can predict the risk of privacy disclosure caused by users' access behavior, this discovery helps them to formulate scientific access control strategies. However, since the HISs of most hospitals are different at present, the privacy-preserving model proposed in this paper cannot be fully adapted to all hospitals. In the future, we will study a more compatible access control privacy protection model, provide new ideas for the hospital's resource management model, and promote the overall progress of the hospital's comprehensive management capabilities.
Data Availability e original data of this article have been signed in a confidentiality agreement with the hospital and are temporarily unavailable, but the processed data (data used to support the research in this article) can be partially shared publicly and submitted with the manuscript.