Quasi-Identifier Recognition Algorithm for Privacy Preservation of Cloud Data Based on Risk Reidentification

Faculty of Engineering, School of Computing, Universiti Teknologi Malaysia (UTM), Johor 81310, Malaysia Department of Computer Science, Faculty of Computer Science and Information Technology, University of Kassala, Kassala 31111, Sudan College of Computer Science and Engineering, Taibah University, Medina, Saudi Arabia Department of Computer Science, College of Computers and Information Technology, Taif University, P.O. Box 11099, Taif 21944, Saudi Arabia


Introduction
In the modern information age, many companies are using external sources of data for processing, storing, or obtaining some services such as data mining. Unlimited computational resources, reduced costs, nonburden of maintenance, and nondiligence to learn the skills of proficiency in certain services, all of these were temptations to advance to the modern change. However, there are still security and privacy concerns that hinder the use of the features offered by the cloud [1]. Numerous studies clarified that attackers often reveal the information from third-party services or third-party clouds [2]. For example, one of the security breaches in October 2014 was a breakthrough for Dropbox. The attackers stole 700 user passwords to obtain cash values of its Bitcoins (BTC). In 2015, a lot of users' information, which exceeds 4 million, such as the user's name, date of birth, address, e-mail, phone number, and other sensitive data, were leaked through the TalkTalk service provider in the UK. In 2016, Time Warner, one of the largest cable television companies in the United States, has announced that about 32 million passwords and e-mail of the users have been stolen via an attacker. In 2017, more than 200 million data of the users containing users' names, phone numbers, e-mail addresses, home addresses, and other data have been disclosed through the API of McDelivery Company in India [2,3]. A fresh security violation in Google displayed that any administrator of the server who has access to the secret information can misuse it easily. The worst problem is that administrator of the honest-but-curious server can violate privacy without being discovered [4].
Three kinds of the disclosure can cause privacy leakage, identity disclosure, attribute disclosure, and membership disclosure [5]. In attribute disclosure and identity disclosure, the intruder identifies that the tuple of the target individual is found in the released dataset and he aims to acquire some private/sensitive data about that individual from the released dataset [6]. Serious issues that lead to identity disclosure are quasi-identifier (QID) value linking and the attacker's knowledge background. The QIDs are the dataset attributes that if each of them is considered separately does not distinguish the individual, but when several attributes are combined they can give a distinctive identification of individuals [7]. For example, when looking at the attributes of date of birth, gender, and ZIP code together, one can reidentify the individuals as stated in [8]. Reidentification of the individuals through linking their QIDs leads to what are called linking attacks. Therefore, the careless publication of QIDs will lead to leakage of privacy [9].
One of the popular practices to avoid privacy leakage is anonymization. The anonymization can be performed via several types of transformations, by removing the values, changing the structure, replacing the values by taxonomy, and combining the values. The anonymization-based methods use one or a combination of operations to accomplish an optimum level of concealment [10]. A commonly utilized privacy criterion of anonymization is k-anonymity introduced by Sweeney [8]. The k-anonymization model is aimed at making any record in the released dataset that cannot be distinguished from at least (k − 1) other records [1,11]. To avoid the linking attacks, k-anonymization can be used. The effective method to determine the real QIDs is the primary issue for privacy-preserving methods based on k-anonymity or other anonymization models seek to prevent QID linking. While most of the current methods neglected this issue or just determine QIDs manually, this reduces the validity of the anonymization method as well as negatively affects the usefulness of anonymous data [9]. This study is aimed at overcoming the identity disclosure resulting from QID linking and reducing the leakage of privacy by proposing a QID recognition (QIR) algorithm based on risk rate reidentification. The proposed algorithm comprises two main stages: (1) attribute classification (or QIDs Recognition) and (2) QID dimension identification. The algorithm works based on the reidentification of risk rate for all attributes and the dimension of QIDs where it determines the proper QIDs and their suitable dimensions. Figure 1 shows the cause-effect diagram of privacy leakage. The dark boxes in Figure 1 explain the privacy leakage causes addressed by the proposed QID recognition (QIR) algorithm in this study. As shown in Figure 1, it is essential to properly identify the QID attributes to overcome the identity disclosure to reduce the leakage of privacy resulting from QID linking. This paper is made up of 5 sections. Section 2 describes the state of the art of privacy-preserving data mining (PPDM) over the cloud, whereby some of the current methods and algorithms that address the issue of identification QIDs accurately to avoid identity disclosure are presented. A detailed description of the proposed algorithm has been provided in Section 3. Section 4 demonstrates the experimental evaluation, discussion, and comparison with related work. Section 5 concludes this work.
Huang and others [36] introduce a new method that depends on the hypergraph to find a group of related views and QID set. This method maps the group of related views into a hypergraph and includes all paths available between every two nodes instead of finding the group of related views. The weakness of this method is that the QID group produced may include so many attributes. Further, it has high computational complexity resulting from the process of degeneration of the common graph from the hypergraph.
Omer and Mohamad [37] introduce a new method to select a quasi-identifier (QID) to achieve k-anonymity. Selective and decompose algorithms depend on nominating multiple attributes as a set and then generating power set P ðSÞ for them. Following that, the distinct values of the power set PðSÞ elements were computed and listed in a table. Finally, the candidate element from the power set is the element with the maximum distinct value. The main problem in this method is selecting the primary nominate set of attributes, where the accuracy of the selection depends on the user experience [9]. Furthermore, it is impractical to generate PðSÞ if the number of attributes is big (e.g., more than 8).
Y. J. Lee and K. H. Lee [38] examine the factors and the likelihood of an individual reidentified for medical information through inferable QIDs. The QIDs were considered as database variables that enable the reidentification of individuals by linking their QIDs with available external information or a specific individual. They selected five factors to form QID attributes to prevent patient privacy violations. The factors were selected based on their influence on the likelihood of reidentification and the possibility of inferring it from background knowledge. One of the disadvantages of this study is that the QIDs that can be extracted to reidentify patients' records may exceed 5. Besides, the paper focused only on the problem of reidentification of patients' records and avoiding leakage of privacy in the medical 2 Wireless Communications and Mobile Computing records, lacking a public method that could be used for general data publishing. Bampoulidis and others [7] assume that some QIDs are more important than others (i.e., in data mining/analysis) and, therefore, should be distorted as little as possible in the anonymization process. They present a tool to address the issue of QIDs by utilizing a local recoding algorithm for k-anonymity. The tool outperforms the ARX (data anonymization tool) in terms of dataset quality. The major problem with this method is that it depends on the user in defining the QID attributes, giving priority to each attribute, as the user relies on his personal experience in determining the QID attributes, which are usually not accurate. Kaur and Agrawal [10] study the impact of QIDs on the anonymization process. They gave new ways to consider before choosing the quasi-identifiers. The reidentification risks have been examined using different QIDs, diverse parameters, and different sizes of a data sample. The results of their work showed that when making the variance in selecting the QIDs for anonymization operation, note that the risk of reidentification increases when the number of QIDs increases, and it decreases when using QIDs that contain fewer categories. Although it is good to take into account these observations before starting the anonymity process, it should be noted that these observations extracted by the study are not fixed and may change from one dataset to another.
Wong and others [39] do not reveal the complete set of quasi-identifiers (QID) to the data collector before and after the data anonymization process. They believed that the QIDs can be both sensitive values and identifying values; they allow the respondents/data owners to hide sensitive QID attributes from other parties. The first issue with this method is that the QID attributes that respondents consider them are sensitive which may contain data that are very useful in mining or may adversely affect mining outcomes. The second issue is if respondents submit inaccurate data, there is no guarantee of the usefulness of the results obtained from data analysis.
Sei and others [40] consider that some QIDs are regarded as sensitive QIDs and they propose novel privacy models, namely, ðl1, ⋯, lqÞ − diversity and ðt1, ⋯, tqÞ − closeness, and a method that can treat sensitive QIDs. Their proposed method comprises two algorithms: anonymization and reconstruction algorithms that can treat sensitive QIDs. Although this method can perform anonymity while preserving the quality of the data, it suffers from the problem of the Wong [39] method; this is because there is no effective method to accurately determine which of the QID attributes is considered sensitive QIDs.
Victor and Lopez [41] offer a ðk, n, mÞ anonymity method for sensitive/private data based on the k-anonymity. The graph algorithms were used to perform QIDs and are moreover been improved by selecting similar QIDs based on the composite and derived attributes. The set of QIDs obtained from the methods in [36,41] may include too many attributes, which increases the information loss in models based on generalizations like the k-anonymity [9].

The Proposed QID Recognition Algorithm
There are two main stages involved in the QID recognition algorithm (QIR) to prevent privacy leakage of outsourced data. First, classify the dataset attributes into quasiidentifiers (QIDs), sensitive attributes (SAs), and nonsensitive attributes (NSs). That is, each attribute in the dataset is classified into one of the aforementioned groups (QIDs, SAs, or NSs). In the attributes' classification (QID recognition) stage, the IDs (identifier attributes) are usually removed from the dataset by the data owner. The quasiidentifiers (QIDs) are the attributes that, when linked together, define the individual, for example, age, gender, and ZIP. The sensitive attributes (SAs) are the attributes that explain sensitive/private information about an individual such as medical information, financial records, and location. Meanwhile, the nonsensitive attributes (NSs) are the other attributes in the dataset that do not fall under the previously mentioned categories, as they do not help reidentify the identity of the individual, for example, state and religious attributes. In the basic privacy models (such as k-anonymity [7-9, 11-13, 18, 28], l-diversity [40,42], and t-closeness [34,43]), the attributes of a dataset were categorized into two groups: sensitive and nonsensitive. Meanwhile, most of the recent researchers such as in [9,[44][45][46][47] divide the dataset attributes into three types: QID, SA, and NS (not including identifiers) directly. Accordingly, the classification of dataset attributes in this study is divided into three types of QID, SA, and NS (not including identifiers) utilizing the same definitional meaning of each category as in the previous work in [9,[44][45][46][47].
Second, determine the actual dimension of QIDs that should be used in an anonymization operation that will achieve optimum case. If the set of QIDs contains too many attributes, the loss of information caused by generalization will be exacerbated. Nonetheless, sometimes the minimal set of QID does not imply the most appropriate privacy protection setting because the method does not consider what attributes the adversary could potentially have [37]. Therefore, we need a mechanism that determines the appropriate dimension of the QIDs to avoid these problems. In the QID dimension determining stage, the proposed algorithm performs this task. Figure 2 illustrates the general procedure of the two main phases of the QIR algorithm. The following subsections explain these two stages in more detail.

QID Recognition Stage.
In this stage, the algorithm classifies the attributes depending on the reidentification risk rate for each attribute in the dataset, and then, the risk rate of the attribute is compared to the threshold values of the classification. As shown in Figure 2, the attribute classification stage comprises four main activities. These activities include (1) dataset preprocessing, (2) computing risk rate for all attributes, (3) selecting the classification thresholds, and (4) classifying the attributes according to the selected thresholds.
In the first activity, the dataset is preprocessed which includes filling the missing values, fixing the inconsistencies in the dataset, and data normalization. Then, in the second activity, the risk rate is computed according to the g-distinct which is adopted in computing the reidentification risk rate [48]. A detailed description of the g-distinct method is presented in the next section. In the third activity, the classification thresholds were selected based on the maximum and minimum risk of reidentification as follows. These thresholds are denoted by β and α in this study; α threshold represents the maximum risk of reidentification of the individual while β represents the minimum risk of reidentification. The threshold values can be determined by the user or the data owner after calculating the reidentification risk for all attributes. Based on percentages of the highest and lowest attribute risk, one can choose the α value to be less than the highest risk value and choose the β value to be less than the lowest risk value. The nature of the data and the degree of importance of each attribute affect the selection of the threshold values. So, these thresholds are adjustable and differ from one dataset to another. For instance, let the dataset ðDÞ contain attributes ðA 1 , A 2 , ⋯, A n Þ, i.e., D = A 1 , A 2 , ⋯, A n ; let β = 0:05% and α = 30%. Let Rrisk A i be the reidentification risk of attribute A i and Rrisk A i = 35%. As Rrisk A i > α, then A i is classified as SA. Suppose Rrisk A 3 and Rrisk A 5 are 23 and 0.01, respectively, then A 3 is classified as QID while A 5 will be classified as NS, respectively. Reidentification risk rate of attribute A i computes the degree that makes the records distinguished based on this attribute. Finally, the fourth activity includes classifying the attributes according to the selected thresholds using rules represented by if-else testaments (see Algorithm 1, lines 27-39). In the following subsection, a detailed description of computing the reidentification risk rate (g-distinct) is presented. More explanation of the QID recognition stage is also presented.

g-Distinct.
The g-distinct is adopted in computing the reidentification risk rate [48]. A person or record in any dataset is said to be unique if he/she or it has a combination of attributes that is not for someone/record else. The person/record is g-distinct if their combination of attributes is matching to g-1 or less than other people/records in the dataset [48]. Thus, uniqueness is the base situation of 1distinct. In general, g-distinct is the total of the number of subgroups with i individuals, which is computed as where f n ðiÞ refers to the expected number of subgroups with i individuals that can be derived from a given aggregated group and g represents the whole number of individuals in a subgroup. That is, g is associated with the g-distinct to represent the number of distinguished individuals in the subgroup. For example, when we say 3-distinct, it means that three individuals have common QID characteristics out of the total number of people g in the subgroup. The sum of all g-distinct of individuals in a specific attribute represents the reidentification risk rate that the attribute potential to cause it. We can compute the general risk of the whole 4 Wireless Communications and Mobile Computing dataset through equation (2) where b is the number of possible subgroups.
Finally, the attribute classification stage returns the reidentification risk rate for each attribute in the dataset. Based on the resulting reidentification risk rates, the dataset attributes are classified to sensitive and nonsensitive according to the rate of the reidentification risk for each attribute in addition to threshold values β, α. The outcomes of this stage will be input into the QID dimension identification stage to determine the dimension of QIDs that is suitable to achieve optimal privacy requirements. The practical steps of the classification stage are explained by Algorithm 1. Lines 2-16 in Algorithm 1 are to compute the g-distinct for all dataset attributes while lines 18-26 are to calculate the reidentification risk rate based on the attributes' g-distinct. Finally, lines 28-40 addressed the process of attribute classification using the reidentification risk rate of each attribute to produce three categories of attributes: QIDs, SAs, and NSs.
The importance of this stage of the proposed algorithm represented by Algorithm 1 is that it contributes to reducing the attribute disclosure resulting from linking the QID values due to a weakness/failure in defining the QID characteristics correctly. This contribution helps in minimizing the leakage of information and avoiding privacy violations.

QID Dimension Identification
Stage. This stage of the algorithm is aimed at determining the best dimension of QIDs that will achieve optimum cases. The optimum case gives high privacy with a high/reasonable percentage of preserving data quality. In other words, it has high privacy gain (PG) with high/reasonable nonuniform entropy (NUE). Algorithm 2 describes the implementation steps for this stage. The algorithm takes a sample of data with the QID that has the highest reidentification risk rate. Following that, the QIR calculates the PG and NUE base on k-anonymity   (3) and (4). In the next step, the QID number is increased, and PG and NUE are calculated again and so on until all QIDs are finished.
Finally, the algorithm determines the optimum case that gives high privacy with a high/reasonable percentage of preserving data quality. The best QID dimension is the QIDs with the optimum case. Algorithm 2 provides the executive steps of this stage; lines 5-12 implement the anonymization by k-anonymity on a sample of the dataset. It begins with QID that has the highest reidentification risk rate. After that, the algorithm calculates the privacy gain (PG) and nonuniform entropy (NUE) through equations (3) and (4). Then, the QID number is increased; PG and NUE have been calculated repeatedly until all the QIDs are finished. Lastly, in lines 13-15, the algorithm determines the best QID dimen-sion (QidD) that achieves the optimum case to be involved in the anonymization process.
It was observed in study [9] that in most cases, when the QID dimension is large, the data loss increases. However, when the QID dimension is small, the privacy protection is not applied optimally because one cannot know what the actual QIDs an attacker possesses [37]. Therefore, determining an appropriate QID dimension is important to reduce data loss.

Performance Measures.
Two performance evaluation measures were used in this study: the privacy gain (PG) and the nonuniform entropy (NUE). More explanation and the derivation of these measures are presented in the following subsections.

Nonuniform Entropy.
In the context of data deidentification, the nonuniform entropy is to compare the frequencies of attribute values in the transformed dataset according to frequencies in the input dataset; it was originally introduced as a model for measuring the loss of information [51]. When a dataset D is transformed into another dataset D ′ , nonuniform entropy is defined as

Experimental Evaluation
In this section, the experimental evaluation of our implementation algorithm will be presented in terms of PG and NUE. In Dataset Setup, we describe the datasets we have used for running the experiments and the experimental environment setup. In Experimental Results, we present the first set of experiments and provide the results from our algorithm. In Performance Benchmark and Discussion, we provide benchmark and discussion results of our algo-rithm against a close recent algorithm introduced by Omer and Mohamad [37].

Dataset Setup.
Two real-life datasets from the University of California-Irvine were used in this study to demonstrate the performance of the proposed algorithms. The first is the bank direct marketing dataset [52]. The bank dataset consists of 17 attributes and 45,211 tuples and does not include any missing values. The dataset attributes are divided into three divisions which are (1) data of bank clients: age, job, marital status, education, default, balance, housing, and loan; in this paper, we will consider these attributes because these attributes are significant for bank clients and reidentification purposes; (2) data related to the last contact of the current campaign; and (3) other attributes like the campaign and days. The second dataset is the adult dataset [53] used as a standard for anonymization algorithm evaluation [7] consisting of 48,842 census records and 15 attributes. ARX data anonymization software is open source introduced and developed by Prasser et al. [54] for data anonymization; we used it to implement the algorithms as explained in the following sections. The experiments were executed on Input: dataset sample d, QIDs [ ], privacy parameter k. Output: optimal dimension of QIDs. 1: QidD ⟵ dimension of QIDs 2: QidD ϵ QIDs ½ 3: Optimal QidD ⟵ Optimal dimension of QIDs 4: QidD½ = 0 5: Fori ≔ 1 to QIDs ½ :lengthdo 6: QidD½i = QidD½ + QIDs½i; 7: Anonymized data ½i = k-anonymityðd, QidD½i, kÞ; 8: PG ½i = Privacy gainðAnonymized data½iÞ; 9: NUE ½i = Nonuniform Entroy ðAnonymized data ½iÞ; 10: Difference½i = PG ½i − EIL½i; 11: i = i + 1; 12: end 13: IfððPG½ == maxÞ&&ðNUE½ == maxÞÞ 14: Optimal QidD½ = QidD½i; 15: ReturnðOptimal QidD ½ Þ: Algorithm 2: QID dimension identification.

Experimental
Results. The first experiment is to classify the dataset attributes according to their risk rate. Figures 3   and 4 illustrate the risk rate for bank attributes and adult attributes, respectively.
For the bank dataset, we identify α and β as α = 30, β = 0. Table 1 demonstrates bank attribute classification. In the adult dataset, we add α = 0:2, β = 0:01 to classify the attributes. Table 2 demonstrates the classification of the adult dataset. Because the "balance" attribute has a risk of 52.04 %, which is large compared to other attributes, it is excluded       Wireless Communications and Mobile Computing from Figure 3 to highlight the difference between the attributes that have relatively small risk values. After calculating the risk rate of each attribute in the dataset, the attribute is classified according to the selected threshold α and β as was explained in QID Recognition Stage. Tables 1 and 2 show the classification results of the bank dataset and the adult dataset, respectively, according to the selected classification thresholds α and β for each dataset. After the classification stage, the best dimension of QIDs that achieves optimum case should be determined. In the bank dataset, the QID dimension (QidD) is four (QidD = 4) while in the adult dataset QidD is 10 (QidD = 10). For each dataset, the initial value of QID dimension is set to one ðQidD = 1Þ to be used as input into the proposed QID dimension identification algorithm (as explained in Algorithm 2) Identification of QID dimension begins with the initial value of QidD, and it is incremented until the maximum number of QID dimension. Identification of QID dimension begins also with a sample size equal to 10% of the dataset with k-anonymity of 5, and it is incriminated until k = 25 for each QidD value (sample size is changeable). Then, the privacy gain (PG) and the nonuniform entropy (NUE) are calculated for each sample and each new QidD until QidD values reach four (QidD = 4) for the bank dataset and QidD = 10 for the adult dataset.
Finally, the proposed algorithm returns the QidD that achieves the optimum case to be as the best dimension will be used in the anonymization process. Table 3 demonstrates the results of finding the best QidD for the adult dataset.
According to Table 3, we observed that QidD = 2 is the optimum case that increases the privacy gain as well as the NUE. Moreover, we can notice that the privacy level also  In the bank dataset, the proposed algorithm's selected QID attributes are work-class and hours-per-week (HPW). These two attributes achieve the highest reidentification risk; thus, they must be involved in the anonymization process (see Figure 5).
To determine the best QidD in the bank dataset, track Table 4 and Figures 6(a)-6(c); it is clear that when QidD = 1 the proposed algorithm achieves the optimum case as it gives high privacy in several cases of k values. It can be also observed in Table 4 that NUE drops from 45.28% when k = 5 to 17.27% when k increases above 15. It is also noticeable in the bank database that privacy decreases as the value of QidD increases which is normal with the level of privacy provided.

Performance Benchmark and Discussion.
To evaluate the proposed QIR algorithm, we compare it based on k-anonymity against recent similar work SQI algorithm [37]. The comparison was conducted in terms of their privacy gain (PG) and nonuniform Entropy (NUE). Multiple k values and different dataset sizes of the adult dataset will be used. In Figures 7 and 8, the privacy provided by QIR is more than the privacy achieved by SQI, where the improvement average exceeds 23%. Although SQI outperformed the QIR in data utility represented by NUE at k = 26, 29, 35, with a privacy rate of 9.57%, this is considered a deficiency because QIR provided data utility higher than that with much higher privacy at k = 4, 6, 10, 17, and 20.
In Figures 9 and 10, it can be observed that at 10% of the dataset and k = 10 the privacy achieved by the proposed QIR algorithm is more than double the privacy achieved by the SQI algorithm with slight increases in data utility, that is, the proposed QIR algorithm outperforms the SQI algorithm in terms of preserving privacy and data utility. With data size 20% and k = 20, NUE obtained by SQI and QIR is 30.27 and 31.66%, respectively, while the privacy given by SQI is 20.52% and that by QIR is 51.82 which is twice more than that achieved by SQI. Similar results were obtained at   10 Wireless Communications and Mobile Computing k = 20 and data size = 30% and 90%, respectively. In most cases, when data size increases the privacy decreases, and therefore, the data utility increases. Generally, for the whole adult data, results of the experiments at k = 10 and k = 20 show that the average privacy percentage presented by SQI is 10.17% with 48.62% data utility, while the average privacy percentage offered by the proposed QIR is 46.49% with 41.04% data utility. Also, for the whole adult dataset and all k values experimented, the average privacy provided by SQI is 7.51% against 54.13% data utility, while the average privacy percentage achieved by QIR is 30.67% against 55.46% data utility; hence, using QIR for identification of the real QIDs is considered more ideal.

Conclusions
Accurate identification of QIDs is an important issue for the success and validity methods of privacy-preserving outsourced data that seek to avoid privacy leakage caused by QID linking. This paper is aimed at classifying dataset attributes before the anonymization process and determining the proper QIDs that should be involved in the anonymity operation. A new algorithm is proposed based on the calculation of the reidentification risk for dataset attributes to classify attributes to SAs, QIDs, and NSs based on prespecified thresholds. In addition to attribute classification, the algorithm determines the actual dimension of QIDs that is required in the anonymization process depending on the amount of privacy provided versus a loss of the quality of the data. The experiment results indicated that the proposed identification algorithm has better performance and is more perfect in terms of privacy provided against data utility when compared with other works. Although the proposed algorithm is suitable to be used with any method or privacy model concerned with QID attributes, in this paper, we have relied on the k-anonymity model.

Data Availability
All data used in this article are available in the machine learning repository at the University of California, Irvine (UCI): https://archive.ics.uci.edu/ml/datasets/.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.