PvCT: A Publicly Verifiable Contact Tracing Algorithm in Cloud Computing

Contact tracing is a critical tool in containing epidemics such as COVID-19. Researchers have carried out a lot of work on contact tracing. However, almost all of the existing works assume that their clients and authorities have large storage space and powerful computation capability and clients can implement contact tracing on their own mobile devices such as mobile phones, tablet computers, and wearable computers. With the widespread outbreaks of the epidemics, these approaches are of less robustness to a larger scale of datasets when it comes to resource-constrained clients. To address this limitation, we propose a publicly verifiable contact tracing algorithm in cloud computing (PvCT), which utilizes cloud services to provide storage and computation capability in contact tracing. To guarantee the integrity and accuracy of contact tracing results, PvCT applies a novel set accumulator-based authentication data structure whose computation is outsourced, and the client can check whether returned results are valid. Furthermore, we provide rigorous security proof of our algorithm based on the q-Strong Bilinear Diffie–Hellman assumption. Detailed experimental evaluation is also conducted on three real-world datasets.'e results show that our algorithm is feasible within milliseconds of client CPU time and can significantly reduce the storage overhead from the size of datasets to a constant 128 bytes.


Introduction
In public health domain, contact tracing is a critical approach for identifying people who may have come into contact with diagnosed people infected with some epidemics such as Ebola virus disease, H1N1 influenza pandemic, or coronavirus disease 2019 . By tracing the contacts of infected individuals and treating them appropriately based on their testing results, public health departments can contain and mitigate the community transmission of infectious diseases. In history, contact tracing is commonly used as an important tool to fight against epidemics. For example, during the 2014-2016 outbreak of Ebola in West Africa, the World Health Organization issued guidelines to conduct contact tracing for breaking transmission chains of the Ebola virus [1].
Nowadays, contact tracing is also playing a critical role in all our efforts to contain the ongoing COVID-19 pandemic.
Researchers have carried out a lot of works on contact tracing [2][3][4][5][6][7][8]. For instance, in [4], the authors put forward a privacy-preserving system for contact tracing based on secure two-party private set intersection cardinality technique. In [6], the authors develop a Bluetooth-based contact tracing system. In [7], the authors mainly focus on privacy leakage, and they propose a blockchain-based privacy-preserving contact tracing algorithm. e authors in [8] propose a contact tracing algorithm with access control, which guarantees that only authorized people can execute contact tracing process. However, almost all of the existing works assume that their clients and authorities have large storage space and powerful computation capability and clients can implement contact tracing on their own mobile devices such as mobile phones, tablet computers, and wearable computers. In these literatures, the authorities such as hospitals or Centers for Disease Control and Prevention (CDC) are required to store all the travel records of diagnosed people, and the clients are demanded to store all their travel records at their own sides.
However, with the spread of diseases, the number of travel records of diagnosed people grows rapidly. Meanwhile, besides classical travel records such as accurate location and relevant timestamp, there is a special category of travel records, the transportation data. It contains information such as train/flight number, license plates, and other important information which is also quite useful in the contact tracing process. Meanwhile, the more complete these travel records are, the more comprehensive the contact tracing can be. In applications, all the travel records and transportation data are collected through mobile crowdsensing (MCS) technology, which is capable of sensing and collecting data using various mobile devices belonging to all the clients and authorities. ese terminal devices have certain limitations. First, they have limited hardware resources in storage and computation.
ese restrictions make those mobile devices unbearable to the large storage and computation burden incurred by the rapid increment of travel records scale. Second, it is also difficult to synchronize all the data as well as collaborate all the computation among the clients if there are only terminal devices. If clients are organizations such as colleges, synchronization delay can adversely affect the accuracy and efficiency of disease control over all of their students and staff. at is, the latency can directly lead to the failure of epidemic prevention. Hence, for clients and authorities, they need other ways to deal with the tricky situation introduced by limited mobile devices. To the best of our knowledge, there are no prior works that can be applied to such a scenario where clients are resource-constrained and efficient management is required. In this paper, considering that, with cloud computing service provided by certain cloud service providers (CSPs) such as AWS and Azure, the client can pay for the resource he/she lacks and release it once his/her work is finished, and we resort to CSPs for storage and computation assistance.
Nevertheless, it brings up new issues when introducing CSPs: (1) CSPs can be compromised by a malicious adversary [9] (2) Even if CSPs honestly follow the clients' rules, there still exist various problems such as program malfunction, data loss, or some unintended mistakes erefore, CSPs cannot be fully trusted, and there is no guarantee that the returned results are correct and complete. However, in the contact tracing scenario, the accuracy and integrity of the returned result are of vital importance and sometimes matter tens of thousands of lives. Assume that someone is a close contact of one COVID-19 diagnosed person, but CSPs return a false negative contact tracing result which indicates that he is not. In this case, not only he/ she cannot obtain a timely treatment but also the people who are close contacts of him/her cannot be tracked anymore.
is hinders the process to contain the pandemic.
In this paper, we aim to solve the problem that how to achieve both accuracy and integrity of contact tracing results in untrusted cloud computing setting. Currently, the mainstream method (if not the only method) to satisfy the above requirements is to take advantage of various verifiable computation techniques. A possible solution is to utilize general verifiable computation schemes [10,11] by using techniques such as succinct noninteractive argument of knowledge (SNARK). However, almost all existing SNARK algorithms are too complicated to be deployed in practice [12]. us, in this paper, we adopt another kind of technique which enables ad-hoc verifiable computation through constructing an authentication data structure (ADS) [13,14]. Briefly speaking, an ADS is a data structure whose computation can be outsourced to an untrusted server and the client can check if the returned result is valid.
ere are some ADS-based query authentication techniques studied for outsourced databases [15][16][17]. However, there still exist several main challenges, making the conventional schemes inapplicable to contact tracing. First, the conventional schemes rely heavily on one data owner signing the ADS using a secret key. In contrast, in the contact tracing scenario, there are two data owners and two phases, one is the query phase and the other is the matching phase. Especially in the matching phase, only the authority can append new records of diagnosed people to its database, and a client cannot act as the authority in this phase because he does not have the authority's secret key and cannot sign its ADS. Second, a traditional ADS is constructed on a fixed dataset, and such an ADS cannot be efficiently adapted to a contact tracing scenario in which the data are unbounded with the spread of the disease. ird, in conventional outsourced databases, ADS is always regenerated to support more queries. However, it is difficult for clients who have limited resources to implement. us, a more generic ADS is preferable to support different phases that may happen in contact tracing scenario.
To address this issue, we propose a novel set accumulator-based ADS scheme that enables public verification over contact tracing, guaranteeing both accuracy and integrity check. And, hence, on that basis, we propose a novel framework called publicly verifiable contact tracing (PvCT), which employs publicly verifiable computation techniques to guarantee both integrity and accuracy of contact tracing result. More specifically, we provide each client and authority with an additional ADS. And, based on this ADS, untrusted CSP can construct and return a cryptographic proof, known as verification object (VO), for clients to verify the result of contact tracing. e information flows among CSPs, clients, and authorities, as illustrated in Figure 1.
To summarize, our contributions made in this paper are as follows: e rest of the paper is organized as follows. Section 2 reviews existing studies on contact tracing and verifiable query processing. Section 3 formally defines the problem and its security model followed by cryptographic primitives and assumptions in Section 4. Section 5 presents the detailed PvCT algorithms based on a family of verifiable set accumulators. Security proof and performance evaluation are given in Sections 6 and 7, respectively. Finally, we conclude our paper in Section 8.

Contact Tracing Algorithms.
Due to the rapid spread of the COVID-19 pandemic and the importance of contact tracing, many research groups proposed their algorithms to improve contact tracing. Some of the algorithms rely on and expose records to a trusted third-party, such as BlueTrace [6], and some of them introduce a decentralized/public list approach: Private Kit [3] enables clients to log their own information (like locations) and can help the authority to contain an epidemic outbreak, Apple and Google [18,19] have made joint efforts that support privacy-preserving contact tracing by inferring linkages, and Epione [4] provides end-to-end privacy-preserving contact tracing, or privacy-sensitive protocols and mechanisms for contact tracing [20], in which all personal data is locally stored on the phone and it is voluntary for users to publish/upload the data. Although contact tracing has been intensively studied as mentioned above, there are still no existing works that take resource-constrained clients into consideration. In other words, none of these works can be applied in cloud computing scenario. Table 1 provides a comparison of different contact tracing algorithms with respect to accuracy/integrity properties, client's storage cost, and verifiability, and N is the total number of contact tracing records. All of which are important for verifiable contact tracing in cloud computing scenario.

Verifiable Query
Processing. Plenty of verifiable query processing algorithms have been studied to ensure the integrity of query results against an untrusted service provider (such as [15][16][17][21][22][23]). Most of the existing works focus on outsourced databases and there are two basic approaches: enabling general queries using arithmetic/Boolean circuitbased verifiable computation schemes (SNARKs) and enabling ad-hoc queries using an authenticated data structure (ADS). Constructing efficient SNARKs and optimizing their implementation is a very active area of research [10][11][12][24][25][26][27][28]. Pinocchio [12] utilized quadratic arithmetic programs to support arbitrary computation but at a very high expenses and occasionally impractical overhead. Moreover, it is difficult to amortize its preprocessing computation overhead if conducted a new one for each program. To remedy this issue, lots of work have been proposed, such as, recently, Xie et al. [11] proposed a zeroknowledge proof system in which the preprocessing time is only dependent on the size of the related circuit and irrespective of the circuit type. e ADS-based algorithm is more efficient compared to the above SNARKs as it tailored to ad-hoc queries. Our proposed algorithm belongs to this sort of algorithms. An ADS is a special data structure with additional  authentication properties. In most cases, it has the form of "additional authentication value + regular data structure" so that the computation of the corresponding regular data structure can be outsourced to an untrusted server and the client can check if the returned result is valid. And, there are two basic techniques which are commonly utilized to serve as an ADS: digital signature and Merkle Hash Tree (MHT). Digital signatures employed asymmetric cryptography to verify the authenticity of digital messages. To support verifiable queries, it requires the data owner to sign every data record through his secret key; meanwhile, the verifier (client) can use the owner's public key to verify the authenticity of a value and the signature of the value. Hence, it cannot scale up to large datasets [17]. MHT, on the other hand, demands only one signature on the root node of a hierarchical tree [29]. Each entry in a leaf node is assigned a hash digest of a data record, and each entry in an internal node is assigned a digest derived from the child nodes. e data owner signs the root of MHT, which can be used to verify any subset of data records. MHT has been widely adapted to various index structures, such as the authenticated prefix tree for multiresource datasets [16] and the Merkle B-tree for relational data [21]. However, so far, no work has considered the integrity issue for verifiable contact tracing over cloud computing.

Problem Definition
Different from common settings such as [4,5] in which the clients only require a returned result no matter whether it is authentic or not, we present the definition of the problem in publicly verifiable contact tracing setting as follows. e clients in our system not only submit their queries and expect to receive a result but also require that the correctness of the result can be verified. Besides, the service in our scenario is provided by two cloud service providers.
As shown in Figure 2, we assume the complete version of data which can be used in contact tracing is collected from each individual of a client (like students of the college), and it is stored in a database DB p of a public cloud server (such as AWS or Azure). e records of the ith individual can be modeled as a set P i , and P i � l ij ‖t ij ‖id ij | j ∈ [m] , where we assume that l ij refers to locations where the ith individual had been to, t ij is the related timestamp when the ith individual had been to that location, id ij is the identity number of each individual which is known as a unique number of all the individuals belonging to the client, and m is total number of the records of ith individual.
Unlike traditional query scenario which only keeps one database, in our scenario, to protect the privacy of a diagnosed people as much as possible, we maintain a separate database called DB s to keep the data of diagnosed people. In reality, this means that the related data of the diagnosed group is kept and obtained only through authorities, such as the hospital, Center of Disease Control, and other government departments. To support public query and verification, once there is a person who is diagnosed, the relevant authority will upload the patient's location and related timestamps D i � 〈l i � � � �t i 〉 into a separate database DB s which can be hold in a different cloud server. Here, it should be noted that since identity number id i is not essential for the final contact tracing matching part, DB s only stores location and related timestamp of each diagnosed person. Meanwhile, in this way, we can shuffle the records of all the diagnosed people so that not only can it protect a diagnosed person's personal information without any additional computation-intensive overhead (such as encryption) but also can it prevent any adversary from finding out whom the records belong to. Notably, our contact tracing scheme is the only one that takes into consideration information beyond just location, which differentiates us from all existing ones. Considering that if an individual takes some transportation such as airplanes, trains, taxis, or buses, then, clearly, information about the transportation is of great importance. erefore, the relevant transportation information of the ith individual (such as flight number, train number, and stations) is stored in both the databases. To enable verifiable contact tracing processing, an authenticated data structure (ADS) is constructed and embedded into each set of records uploaded by the clients or authorities.

System Model.
Here, we now give a detailed description of our publicly verifiable contact tracing algorithm as follows.
First, in our scenario, clients are organizations such as colleges as we described in our introduction, and they would like to do check intimate contacts of their staffs under the rapid epidemic circumstances.
en, there are four main parties in this situation: individual client, authorities, Cloud Service Provider I (CSP I ), and Cloud Service Provider II (CSP II ). On this condition, we can easily find out there are two phases during the whole contact tracing processing: the query phase with CSP I and matching phase with CSP II . We discuss the following cases that may happen in these two phases separately. Phase 1. In the query phase, clients may wish to search the records appearing in a certain time period uploaded by his choice. Specifically, query Q is in the form of is a certain time window selected by the client and i d is the identity number belonging to the client. As a result, CSP I returns all records such that Example 1. In a COVID-19 contact tracing process, the time period of the query in Phase I can be a 14-day time window from the day he issued the query back to the beginning. en, a client may issue a query q � ([2020 − 10 − 01, 2020 − 10 − 14], Alice) to find all of the records stored in DB p from October 1st to October 14th of 2020 and being associated with person whose ID is Alice.

Phase 2.
After the query phase, in the matching phase, clients may want to utilize their staff's records which are obtained in the query phase to find out whether they are positive contacts of diagnosed people or not. en, a client may transfer his staff's records re i s to CSP II for executing the matching process. As a result, if the intersection I between the staff's records and the diagnosed people's records is empty, CSP II returns the negative contact result to the client. Otherwise, CSP II returns the positive contact result and the intersection to the client.

Example 2.
Assume the target staff's records of clients are 2020/10/01‖X shopping mall, 2020/10/02‖Y restaurant, 2020/10/14 Z district}, and the records of diagnosed people are 2020/10/01‖A market, 2020/10/03‖B restaurant, 2020/ { 10/13 C district}; then, after the clients send their staff's records to CSP II to do the matching process, CSP II can find out none of the diagnosed people had been to the same place as the staff, as the intersection between them is empty and the staff is a negative contact. Otherwise, if the records of diagnosed people are 2020/10/01‖A market, 2020/10/ { 03‖B restaurant, 2020/10/130 ‖C district2020/10/14‖ Z district}, then the CSP II can find out the staff is a positive one.
Additional examples can be found in Figure 2.

reat Model.
We consider CSPs, the two untrusted cloud service providers in the contact tracing framework, to be the potential adversary. Due to various issues such as security vulnerabilities, program bugs, and commercial interests, the CSPs can provide unfaithful contact tracing process, thereby returning incomplete or incorrect query and matching results. To address such kind of threat, we introduce publicly verifiable contact tracing that enable CSPs to prove the integrity and accuracy of query and matching results. Specifically, during the query phase, CSPs examines the ADS embedded in the records and constructs a verification object (VO) that includes the verification information of the related results. Using the VO, the client can establish the accuracy and integrity of the query and matching results, under the following criteria: (1) e main challenge in this model is how to design an ADS which can be easily adapted to the contact tracing framework; meanwhile, VOs can be efficiently constructed, incurring small bandwidth overhead and fast verification time. We address this challenge in the next few sections.

Preliminaries
is section introduces major notations, as shown in Table 2, cryptograhic primitives, and security assumptions that are used in our algorithms' design.

Bilinear Pairings.
Let G be a cyclic multiplicative group of prime order p and let g be a random generator of G. G T is also a cyclic multiplicative group of prime order p.
en, a bilinear pairing is a map e: G × G ⟶ G T , and the map e satisfies the following conditions: generates G T (iii) Computability: group operations of G and the calculation of bilinear map e are both efficient, i.e., computable in polynomial time For clarity of presentation, we assume, for the rest of the paper, a symmetric (Type I) pairing e. We note that our construction can be securely implemented in the (more efficient) asymmetric (Type III) pairing case, with straightforward modifications (refer to [30], for a general discussion on pairings). And, our security proof is based on the  Figure 2: System model of PvCT.

Security and Communication Networks
q-Strong Bilinear Diffie-Hellman (q-SBDH) assumption over groups with bilinear pairings presented in [31].

Assumption 1. (q-Strong Bilinear Diffie-Hellman assumption).
Let κ be the security parameter and pub � (p, G, G T , e, g) be a tuple of bilinear pairing parameters. For any probabilistic polynomial-time (PPT) adversary Adv and for q being a parameter of size polynomial in κ, there exists negligible probability neg(κ) such that the following holds: Lemma 1 (see [32]).The intersection of two sets X 1 and X 2 is empty if and only if there exist polynomials Φ 1 (s) and Φ 2 (s) e above result is based on extended Euclidean algorithms over polynomials and provides our essential verification process with the ability to check the correctness of empty set intersection.

Cryptographic Hash Function.
A cryptographic hash function Hash(·) is a mathematical algorithm which takes an arbitrary length string as its input and returns a fixed-length bit string. It is a one-way function, i.e., a function which is practically infeasible to invert. Meanwhile, it is collision resistant meaning that it is computationally infeasible to find two different messages, m 1 and m 2 , such that Hash(m 1 ) � Hash(m 2 ). Classic cryptographic hash functions include MD5, SHA-1, SHA-2, and SHA-3, where in recent years, the widely used hash function SHA-256 is a kind of SHA-2 family.
is all the coefficients of the polynomial, and given x 1 , . . . , x n , it can be computed with O(n log n) complexity.
Lemma 2 presents an efficient process, given x 1 , . . . , x n , and the coefficients of a degree-n polynomial can be quickly computed. is lemma is based on an FFT algorithm [33] that computes the DFT in a finite field, such as Z p , and we used it in our constructions for arbitrary n and performing O(n log n) field operations. And, a detailed proof has been shown in [32], so we omit it here.

Cryptographic Set Accumulators.
Our set accumulator is parameterized by a set of operations Q. For example, for our construction, it includes (1) subset ⊆, intersection ∩ : these functions take two sets as input and output a set. And, for intersection ∩ , there exists two kinds of situations, and the first one is that the intersection is empty and the second one is that the intersection is not empty in which we should take its completeness into consideration. (2) ∈, ∉ : these functions take the set as input and output a value with type Boolean or integer (the output can also be viewed as a set with one element). Our set accumulators are all based on bilinear pairing and q-SBDH assumption as we have presented above.
Inspired by [32,34,35], we give formal definition of our set accumulators which consists of the following PPT algorithms: (i) (sk, pk)←KeyGen(1 κ ): on input of the security parameter κ, it outputs a secret key sk and a public key pk. (ii) acc(X)←Setup(X, pk): for a set X, it computes the accumulation value acc(X) of X. In our construction, it can be efficiently computed without knowing the secret key sk and using pk only. (iii) (π Q , R)←Proof Q(X 1 , X 2 , Q, pk): in input of a query Q, sets X 1 , X 2 , and the public key pk, it returns the result R � Q(X 1 , X 2 ) along with a proof π Q . (iv) b←Verify Q(acc(X 1 ), acc(X 2 ), R, π Q , pk): on input accumulation value (acc(X 1 ), acc(X 2 ) of set X 1 , X 2 , an result R and a proof π Q for the query Q and public key pk, and it outputs b. If b � 1, the verification process indicates that the query result is valid; otherwise, the returned result is invalid and the accuracy and integrity of this query cannot be guaranteed.
More elaborated constructions of the set accumulator will be given in Section 5.3.

Constructions
In the following, Section 5.1 introduces the whole contact tracing process with Case I and Case II as we present in Security and Communication Networks Section 3. en, we enrich our algorithm by taking transportation data into consideration (Section 5.2). Furthermore, we give the detailed constructions of our main cryptographic building block and set accumulator in Section 5.3.

ADS Construction and Verifiable Contact Tracing Process.
For simplicity, this section will only takes a client's contact tracing query Q which is over one individual into consideration. We assume that database DB p stores all the people's data P � P 1 , P 2 , . . . , P n , where n is total number of the people. e database DB s stores all the diagnosed people's Recall that, in the proposed framework, an ADS is generated for every individual of each client (to be noted, the length of time period which is recorded for an individual depends on the type of the contagious disease, i.e., the incubation period of the disease decides the length of each period; in COVID-19, the period should be at least 14 days).
It can be utilized by the cloud service provider (CSP I and CSP II ) to construct a verification object (VO) for each query. To this end, we extend the traditional data structure by adding an extra field, called AccDigest.
Moreover, AccDigest should have the following three desired properties to be functional as an ADS. First of all, AccDigest should be able to summarize an individual's records in a way that it can be used to construct a proof whether the result matches a query or not. Secondly, AccDigest should be able to support batching or aggregation verification of several devices of one individual or among different individuals.
irdly, AccDigest should be in a constant size rather than a varying size grown in proportion to the number of records of an individual. erefore, we propose to use accumulator as AccDigest: AccDigest � acc(X) � Setup(X, pk), where X stands for the target set which we would like to aggregate. While for better readability, we defer detailed constructions to Section 5.3.

Verifiable Contact
Tracing. Given a query Q by a client and two database DB p and DB s , at the end, the client needs to know the result is positive or negative. However, to be noted, in our scenario, recall the process of contact tracing, which we present in Section 3; there are actually two phases contained in a verifiable contact tracing processing, and we will discuss them separately.

Query Phase.
e first phase is Query Phase. Assume all records over one staff of the client (like one of college's student) are DB c � c i � l i ‖t i ‖id i , i ∈ [n] , where n is the number of records. Before the client uploads all the records to CSP I , for the ease of generation of AccDigest and privacy concern, he utilizes a collision-free hash function Hash to hash every record into a fixed-length value, i.e.,, en, the client generates an accumulation value AccDigest c � Setup(H DB c , pk) over the set of all his hashed records. Meanwhile, the client introduces a counter to count how many records the device collected each day; for the record set of the jth date date j is D date,j , the device has num j � count(D date,j ) records; then, we store it as an additional information for our verification. If there are records of t days that the device will upload to CSP I , then we set After all the above setup processes, the client can issue a query Q � 〈t i , id i 〉 to CSP I to obtain all of his records. Here, t i is the date that the client would like to retrieve his records and id i is the identity number of an individual which the client wants to retrieve over this query. Obviously, the main challenge in this phase is how to verify both the correctness and integrity of the returned result using the corresponding AccDigest.
For instance, the query condition which the client issued is Q � 〈2020/9/1 − 2020/9/15, Alice〉, that is to say, the result that the client would like to obtain all his records over an individual and his ID is Alice, and the date the client issued is from 2020.09.01 to 2020.09.15. Because our verification algorithm is independent of the retrieval algorithm, in other words, any existing retrieval algorithm such as [36] can be used in our construction and do not affect accuracy and security of our algorithm, we omit the detail of retrieval process here and only consider after CSP I finished its retrieval process and obtain the corresponding result Re. en, we can apply Setup(Re, pk) to generate a proof π and use a counter to obtain the number of result set num re as the VO for the retrieval result. Accordingly, the client can first check if the number of records in the returned result set Re is equal to the sum of numbers j∈[date s ,date e ] num j stored in the client side (from the initial day to the last day of the query Q). If the check holds, then the client can utilize e(AccDigest c , π 1 )� ? e(π 2 , g) to verify the accuracy of the result Re. e whole process of this phase is detailedly specified in Algorithm 1. If the verification process in this phase fails, i.e., b = 0, then the whole contact tracing execution aborts. Otherwise, the verification process in the query phase succeeds, then we proceed to the matching phase.

Matching Phase.
e second phase is the Matching Phase. In this phase, the client issues a matching query Q � Re * (Re * here is a variation of set Re, due to id i , i.e., the identity number of the result does not required in the matching process, en, before they upload these records to CSP II , the authorities generate an accumulation value AccDigest au � Setup(H DB s , pk) and make this value public to all the clients for further verification.
For example, if the result set that the client obtained is Re * � 2020/09/01‖A laboratory, 2020/09/02‖‖B market, 2020/09/03‖C restaurant}. Assume there is an intersection I � Re * ∩ DB s ; then, if the intersection I is empty, it means that none of the diagnosed people have been to A laboratory, B market, and C restaurant at the same time with the client. is also means the client is a negative contact. en, CSP II can utilize Proof Emp(H Re * , DB s , pk) to generate a proof π 1 and send the result 〈negative, π〉 to the client. According to the result, the client can use Verify Emp(acc(Re * ), AccDigest au , π 1 , pk) to verify if the negative judge by CSP II is trustworthy. If b � 1, it means the verification process in this phase succeeds, and the client can be sure of the negative result. Otherwise, the client will refuse to believe the negative result.
At the meantime, if the intersection I is not empty, which means at least one of the diagnosed people has been to A lab, B park, or C restaurant at the same time with the client. We assume I � 2020/09/01‖A laboratory ; obviously, the client is a positive contact which is with high possibility to be infected. en, CSP II can apply Proof Int(H Re * , H DB s , H I , pk) to generate a proof π and send the result 〈acc(H I , positive, π〉 to the client. According to the result, the client can use Verify Int(acc(H Re * ), AccDigest au , acc(H I ), π 2 , pk) to verify if the negative judge by CSP II is trustworthy. If b � 1, it means the verification process in this phase succeeds, and the client can be sure of the positive result and he can just seek medical care as soon as possible.
Otherwise, the client will refuse to believe the positive result. e whole process of this phase is detailed in Algorithm 2.

Transportation Data.
ere is a special kind of data needing further discussion, which is transportation data. Assume clients or diagnosed people have used vehicles (such as airplanes, trains, buses, or taxis) for short or long trips; then, obviously, data of vehicles (such as flight/train number, bus plate, departure, and terminal station) is significant in our contact tracing algorithm. For example, if a diagnosed person has taken an airplane, then besides the departure and terminal station information, the closest contact of this diagnosed person is the passengers who took the same flight; in other words, the flight number is essential for the matching process. As for trains and other public transport such as buses or MRT, obviously there may exist multiple stations rather than an initial station and a terminal station. erefore, the information of middle stations is also important in our contact tracing process. To conceptualize this, we analyze two possible cases of transportation dataset that may happen in the matching process, as shown in Figure 3.
In the first case shown in Figure 3(a), we assume that the client got on the bus at initial station and got off at middle station 1 and a diagnosed person got on the same bus at middle station i(i > 1). en, mathematically, there exists two datasets consisting of station information; the first set starts from the initial station to middle station 1, i.e., In the second case shown in Figure 3(b), we suppose that the client got on the bus at initial station and got off at middle station i(i > 1); at the meantime, a diagnosed person got on the same bus at middle station 1. Same as Case 1, there exists two datasets; the first set Tran c starts from the initial station to middle station i, and the other starts from middle station 1 to the terminal station. It is easy to find out there exists an intersection I * � Trans c ∩ Trans d , and this intersection I * is not empty. at is to say, the client is a positive contact.
Based on the above analysis of different cases that may happened to transportation data, it requires us to do some additional precomputation on both the client and authorities sides. Meanwhile, in the VO construction part, it also needs to generate corresponding proof to transportation data.
First, for all circumstances, both the client and authorities need to generate accumulation value of their transportation data set through AccDigest Tran � Setup (Tran, pk) for further verification. Meanwhile, it is easy to find out that there is no need to verify the transportation data separately in the query phase because the integrity check of the whole dataset can also pledge to the transportation data.  , π 1 , pk); b 2 ⟵ Verify Emp(acc(H Re * ), AccDigest au , π 2 , pk); b � b 1 ∨b 2 ; Output b; else b 1 ⟵ Verify Mem(AccDigest Tran d , H Of fSta , π 1 , pk); b 2 ⟵ VerifyInt(acc(H Re * ), AccDigest au , acc(H I ), π 2 , pk); b � b 1 ∨b 2 ; Output b; end ALGORITHM 2: e matching phase of verifiable contact tracing over an individual. Security and Communication Networks en, when we execute the matching process, as analyzed above, CSP II has to check whether there is an intersection between the transportation dataset of the client and diagnosed people. As we described before, the first case is that I * � Tran c ∩ Tran d is empty. Mathematically, we can easily find out that proving intersection of these two sets is empty which is equivalent to prove the off station OffSta of the client is not a member of the transportation dataset of diagnosed people. We utilize Proof No M(H OffSta , H Tran d , pk) to generate a corresponding proof π * , which will be sent to the client along with a negative contact tracing result. Accordingly, the client can use Verify No M (AccDigest Tran d , H OffSta , π 1 , pk) to verify the negative result. e second case is that I * � (Tran c , Tran d ) is not empty. Similarly, we can easily find out proving intersection of these two sets is not empty, is equivalent to proving that the off station OffSta is a member of the station set of diagnosed people Tran d . en, we can utilize Proof Mem(H OffSta , H Tran d , pk) to generate a corresponding proof π * , which will be sent to the client along with a positive contact tracing result. Accordingly, the client can use Verify Mem(AccDigest Tran d , H OffSta , π 1 , pk) to verify the positive result. All the detailed process is shown in bold print in Algorithms 1 and 2.

Construction of Set Accumulators.
We now discuss the possible construction of the accumulator which can be used in Section 5.1.
Inspired by [32], we present a construction which is based on q-SBDH and bilinear pairing assumption. It consists of the following algorithms.
(i) (sk, pk) ⟵ KeyGen(1 κ ): let (p, G, G T , e, g) be a bilinear pairing. Randomly choose s from Z q . en, it outputs a secret key sk � s and pk � g s i : i ∈ [q] . (ii) acc(X) ⟵ Setup(X, pk): for a set X � x 1 , x 2 , . . . , x n }, its accumulation value acc(X) � g Ω(X) � g x i ∈ X (x i +s) . Owing to the property of the polynomial interpolation with FFT, it can be efficiently computed without knowing the secret key s.
To make a clear expression of our construction, here, we split the procedures that show how to construct four core proof and verify protocols that meet different query requirements into cases as follows.

Subset
(i) π 1 ⟵ Proof Sub(X 1 , X 2 , pk): given two sets (X 1 , X 2 ) and public key pk, to verify whether X 1 is a subset of X 2 , i.e.,X 1 ⊆ X 2 , we can compute π 1 � acc(X 2 − X 1 ) � g x j ∈ X 2 −X 1 (x j +s) . (ii) b ⟵ Verify Sub(acc(X 1 ), acc(X 2 ), π 1 , pk): the client verifies the following equation: is equation holds if and only if X 1 is a subset of X 2 . In other words, if π 1 is verified as correct, the client is assured that X 1 ⊆ X 2 ; then, output b � 1. Else, output b � 0.
(ii) b ⟵ Verify Emp(acc(X 1 ), acc(X 2 ), π 2 , pk): the client verifies the following equation: e acc X 1 , π 21 · e acc X 2 , π 22 � ? e(g, g).  And, this equation holds if and only if the intersection of X 1 and X 2 is empty. at is to say, if π 2 is verified as correct, the client is assured that X 1 ∩ X 2 � ∅. en, output b � 1, else output b � 0.

(6)
And, this equation holds if and only if the intersection of X 1 − I and X 2 − I is empty. In other words, if π 3 is verified as correct, the client is assured that set I contains all the common elements between X 1 and X 2 . en, output b � 1, else output b � 0.

Intersection
(i) π 4 ⟵ Proof Int(X 1 , X 2 , I, pk): let I be the intersection of X 1 and X 2 . Obviously, the correctness of the set intersection operation can be expressed by the combination of subset and completeness condition. at means I � X 1 ∩ X 2 holds, if and only if the following two conditions holds: According to the conditions mentioned above, we can easily find out that π 4 � π 41 , π 42 , π 43 � Proof Sub I, X 1 , pk , Proof Sub I, X 2 , pk , Proof Com X 1 , X 2 , I, pk .
If the above check on subset proof succeeds, the client verifies the completeness condition through checking the following equation: e π 41 , π 43 · e π 42 , π 43 � ? e(g, g). (9) If the above equation holds, then the client is assured that I is the correct intersection. en, output b � 1, else output b � 0.

Membership
(i) π 5 ⟵ Proof Mem(x i , X, pk): let x i be an element of set X, i.e.,x i ∈ X. en, we can compute π 5 � acc(X/x i ) � g x j ∈ X\x i (x j +s) . (ii) b ⟵ Verify Mem(acc(X), x i , π 5 , pk): the client verifies the membership using the following equation: If the verification succeeds, output b � 1; the client is assured that x i is an element of set X. Else, output b � 0.
If the verification succeeds, output b � 1; the client is assured that y is not an element of set X. Else output b � 0.

Security Proof
In this section, we provide security proofs for our schemes; more specifically, proofs of security for the six set-related operations: Subset (), Empty (), Completeness (), Intersection (), Membership (), and Nonmembership (), in accumulator settings. We will first provide security proofs for two more fundamental set-related operations, namely, Set Containment and Set Disjointness, and then, reduce the security of the six set-related operations in our scheme to the two.
(1) Set Containment(input 1 , input 2 ): this operation takes a set X 1 or an element x belonging to the universe as its first input, and a set X 2 as its second input. It outputs "1" if X 1 ⊆X 2 or x ∈ X 1 and outputs "0" otherwise. It is a generalization of Subset() and Membership(), which provides a unified interface for the two. Informally speaking, if one wants to check whether a set X 1 is a subset of X 2 or whether x is an element of X 2 , in both cases she can use set containment operation. If the inputs are two sets, it is equivalent to
We then proceed to prove that if there exists an adversary A who is able to create legal witness for an incorrect set operation result, an algorithm can be constructed to break q-Strong Bilinear Diffie-Hellman assumption. We define our security games at first.

Security Game 1. q -Strong Bilinear Diffie-Hellman
Game: in this game, an adversary A and a challenger C involve in an interactive process: (1) C prepares a q-SBDH instance ins � (p, G, G T , e, g), g s , . . . , g s q } and sends it to A (2) If A can return a legal pair (a, e(g, g) (1/a+s) ), we say that A wins this game, and thus, it breaks the q-SBDH assumption

Security Game 2. Valid Witness for Incorrect Set Operation
Result Game: this is an abstraction of security games for all the related set operations in our scheme. So, in the description of this game, we do not refer to any concrete set operation: (1) C prepares a group of system parameters params � (p, G, G T , e, g), sk � s, pk � g s i : i ∈ [q] (even if it sends the public part of params to adversary A, there is no essential difference except that in fact A does not need to query C for set witness; it is similar to security games in general public-key encryption, where adversary A can encrypt message itself as well as query and get response from challenger C). (2) A issues witness query on the arbitrary set X she wants, subject to which the cardinality of X should be less than or equal to q. e total query number is also bounded by a polynomial of security parameter, to which we only implicitly refer. (3) After the query phase, if A can return a legal witness (or several legal witnesses) for an incorrect set operation result, we say that A wins this game, and thus, it breaks security of our scheme.

Theorem 1. If there exists an adversary who can provide a valid witness for an incorrect set containment operation result, there exists another algorithm can break q-Strong Bilinear Diffie-Hellman assumption.
Let pub � (p, G, G T , e, g) be a tuple of bilinear pairing parameters. Given elements g, g s , . . . , g s q ∈ G, while s is chosen uniformly at random from Z * p , suppose there exists a polynomial-time algorithm A that can find two sets X 1 and X 2 and a legal witness W such that X 1 ∉ X 2 and e(acc(X 1 ), W) � e(acc(X 2 ), g). en, we can use A to construct a polynomial-time algorithm A ′ to break q-Strong Bilinear Diffie-Hellman assumption. Proof.
e main idea behind the proof is that algorithm A ′ simultaneously takes part in two security games and sits between challenger C (in q-Strong Bilinear Diffie-Hellman security game) and algorithm A (in Set Containment security game). It can prepare parameters for A by utilizing what it receives from C, and then, it forms its own solution to a q-Strong Bilinear Diffie-Hellman instance after some calculation of A's response: (1) Algorithm A ′ first interacts with challenger C. It will receive a q-Strong Bilinear Diffie-Hellman instance to be challenged upon. W.l.o.g., and we denote this instance as ins � (p, G, G T , e, g), g s , . . . , g s q . And, if it can successfully find a pair (a, e(g, g) (1/a+s) ), we say that it succeeds in breaking the q-Strong Bilinear Diffie-Hellman assumption. (2) Algorithm A can arbitrarily choose a set X it wants as a query, with the only restriction that the cardinality of that set cannot be larger than q. Suppose it chooses X, sends it to algorithm A ′ , and asks for the accumulation value for it.
(3) With parameters in ins, algorithm A ′ can easily respond to the request from A in the last step. For example, to generate acc(X) for a set X, A ′ first calculates all the coefficients of polynomial x i ∈X (s + x i ), denoted as α 0 , α 1 , . . . , α m ∈ Z p . en, it calculatesacc(X) � g α 0 × (g s ) α 1 × · · · × (g s m ) α m � g α i s i � g x i ∈ X (s+x i ) . After the calculation, it sends acc(X) as the answer to A. (4) A may conduct other queries (notice that element update operation can be included in this case, e.g., two queries for set X 1 � x 1 , x 2 and set X 2 � x 1 , x 2 , x 3 are equivalent to one query for an update (with value insertion) on X 1 , and identical to a delete operation on X 2 as well.) for the sets it wants, and A ′ will respond to it accordingly, which subjects to the condition that there is a upper bound of the total query number. (5) After the query phase, A generates two pairs (X 1 , acc(X 1 )) and (X 2 , acc(X 2 )), where . . . , x 2,n } are the two sets, acc(X 1 ) � g x 1,i ∈ X 1 (s+x 1,i ) and acc(X 2 ) � g x 2,j ∈ X 2 (s+x 2,j ) , are the corresponding accumulation values. It also generates a legal witness W such that e(acc(X 1 ), W) � e(acc(X 2 ), g), but there is at least one element x satisfies that x∈ X 1 and x ∉ X 2 . en, it sends these values back to A ′ . e reason why chl in Step 6 is a successful q-SBDH pair is that e acc X 1 , W � e acc X 2 , g Since x 2,j ∈X 2 (s + x 2,j ) cannot be divided by s + x, it is reasonable to assume that x 2,j ∈X 2 (s + x 2,j ) � Q(s) (s + x) + R. For ease of readability, we let P(s) � at is, chl � (x, [e(g, W) P(s) · e(g, g) − Q(s) ] R −1 ) is a legal pair to break q-SBDH assumption.

Theorem 2. If there exists an adversary who can provide a valid witness for an incorrect set disjointness operation result, there exists another algorithm can break q-Strong Bilinear Diffie-Hellman assumption.
Let pub � (p, G, G T , e, g) be a tuple of bilinear pairing parameters. Given elements g, g s , . . . , g s q ∈ G, while s is chosen uniformly at random from Z * p ; suppose there exists a polynomial-time algorithm A that can find a group of k sets X 1 � x 1,1 , . . . , X 2 � x 2,1 , . . . , . . . , X k � x k,1 , . . . , a group of kpolynomials q 1 (s), . . . , q k (s), and k pairs of legal witnesses (W i � g P i (s) � g x∈X i (s+x) , F i � g q i (s) )such that X 1 ∩ X 2 ∩ . . . X 2 ≠ ∅ and 1≤i≤k e(g P i (s) , g q i (s) ) � e(g, g). en, we can use A to construct a polynomial-time algorithm A ′ to break q-Strong Bilinear Diffie-Hellman assumption.
Proof. Recall steps in the proof of eorem 1, compared to which the only difference in this proof is that, after the query phase, A returns to A ′ a bunch of different values: [1,...,k] . So, we omit those steps and go straight to the deduction that why with those returned values A ′ can break q-SBDH assumption.
Since X 1 ∩ X 2 ∩ . . . X 2 ≠ ∅, we assume that one of the common elements of these k sets is x, and therefore, we introduce a new notation P i ′ (s) � (P i (s)/s + x). us, we have the following deduction: at is, chl � (x, 1≤i≤k e(g P i ′ (s) , g q i (s) )) is a legal pair to break q-SBDH assumption. Based on eorem 1 and eorem 2, the security of our scheme is obvious.
(1) Subset: security of set containment implies security of it (2) Empty: security of set disjointness implies security of it (3) Completeness: let X i ′ � X i − I; then, security of set disjointness of X i ′ 's implies security of it (4) Intersection: let X i ′ � X i − I; then, security of set disjointness of X i ′ 's and security of set containment of (I, X 1 ) and (I, X 2 ), imply security of it (5) Membership: security of set containment implies security of it (6) Nonmembership: security of set disjointness of set y and set X implies security of it From all the above, we complete the proof of security of our scheme.

Performance Evaluation
In this section, we evaluate the performance of the PvCT framework for contact tracing processing. ree datasets are used in the experiments: (i) Foursquare and Gowalla (see [37] First, as introduced in Sections 1 and 2, to the best of our knowledge, there are two existing "verifiable contact tracing" algorithms [7,8]. Nevertheless, the same term "verifiability" has different meanings between our work and theirs. In their paper, "verifiability" is with respect to access control. More specifically, what needs to be proved as well as what can be verified by the verifiers (some public authorities) in their works is the fact that a person indeed has access authorization to certain related contact tracing information. In other words, they focus on verifying whether clients have the authority to log in to the system as well as issue contact tracing queries. However, our paper focuses on verifying the accuracy and integrity of contact tracing query results. ey actually deal with orthogonal issues, and our algorithms can also be utilized in their schemes to enhance security further. erefore, due to the essential difference mentioned above, there is no need to give an experimental comparison between our verifiability property and theirs.
Secondly, to deal with this special verification problem, namely, verification of contact tracing result.
ere is a possible way to guarantee both accuracy and integrity, which is utilizing general verifiable computation algorithms such as Pinocchio [12] and Geppetto [38]. However, as we introduced in Section 1, general verifiable computation algorithms are not practical to apply in the contact tracing scenario. Here, we provide detailed analysis for different metrics between general verifiable computation algorithms and our algorithms as below: Proof generation: in terms of proof generation procedure, those schemes are all based on the methodology that first translated the target function F to the corresponding arithmetic/Boolean circuit C, and they need to convert the arithmetic/Boolean circuit into a Quadratic Arithmetic/ Span Program (QAP/QSP). Based on these preprocessing, they conduct the following processing. So, the complexity of generating related proofs in these schemes is in proportion to the size of the circuit C as well as the length of the inputs. However, our schemes take advantage of the underlying algebraic structure, not only avoiding extra cost from introducing circuits but also having asymptotic complexity only in the length of the input. More details with respect to the differences between general circuit-based methods and special algebraic-based methods can be found here [39].
Moreover, it also takes O(n log(n)) to convert a function F to a circuit C [40], which is also a substantial computation burden.
Verification: in terms of the verification procedure, a truth we would like to mention is that general verification computation schemes mainly focus on complicated function evaluation, such as large matrix multiplication. If utilizing these algorithms in the contact tracing scenario, they can only verify one record for each execution. erefore, if there are m records in the result dataset, the complexity of verification procedure in general verifiable computation is O(m). In our algorithm, the verification complexity for verifying a result dataset does not grow with m, and it only incurs another constant verification regardless of what number of records in the result dataset, which is O (1). In this way, our algorithm is more efficient and compatible with the contact tracing scenario compared with the general VC schemes.

Performance of Set Accumulators.
We first utilize two synthetic sets to evaluate the performance of the three set accumulators, i.e.,subset(., .), empty(., .), and intersection(., .) in terms of (i) e proof generation time (ii) e verification time (iii) e proof size We set the size of two sets to be 5,000 and select 20% to 50% of the set as target subset or intersection set. As reported in Table 3, the proof generation time is generally longer than verification time, but still acceptable due to this part is processed in the CSP side other than the client side. In contrast, the verification time and proof size are all constants, irrespective of the size of sets.

Verifiable Contact Tracing Performance.
We evaluate the overall performance of publicly verifiable contact tracing algorithm on all three datasets. First of all, Table 4 shows the client's setup cost for the ADS construction time and the ADS size. Although the setup time is a little higher than other algorithms of our ADS construction, it is only a onetime computation, being amortized during the following process.
To evaluate the performance for publicly verifiable contact tracing, we measure the two phases: Query Phase and Matching Phase separately. In the query phase evaluation, we vary the size of result set from 100 to 5,000, and all of them are randomly selected from Foursquare, Gowalla, and Transportation. e results for the query phase are shown in Figure 4. It is observed that CSP CPU time is generally linear to the size of result set, but still within 5.5 s. And, the client CPU time is always around 1.7 ms which is constant, and this suggests that our algorithm is robust against larger result set. Meanwhile, because the proof size is always constant, as shown in Table 3, the whole VO size that CSP returns to the client only depends on the size of the result set.
We next evaluate the performance for the matching phase. First, we assume the matching result is negative, i.e., the client is not a close contact to the diagnosed people. at also means the intersection between the result set and the set of diagnosed people is empty. We vary the size of result set from 100 to 3,700. As shown in Figure 5, the CSP CPU time is in proportion to the size of result set and within 30 s. It is interesting to note that the cost of CSP CPU time is relatively higher for the matching phase than the query phase. is is caused by the computation of extended Euclidean polynomial, which incurs more overhead when computing the proof on the CSP. And, the client CPU time is also constant and around 2.7 ms. Meanwhile, because there is no intersection between the two sets, the VO size is independent of the size of result set and only costs 256 bytes. en, we evaluate the performance when the matching result is positive, i.e., the client is a close contact to the diagnosed people. And, this can contain two different situations as we described in Algorithm 2, the first situation is the intersection I * between transportation datasets of the client, and the diagnosed people is not empty; then, the CSP CPU time contains both the proof generation time of I * and I. e second situation is the intersection I * is empty; then, the CSP CPU time only contains the proof generation time of I. Since there is no difference between these two proof generation processes, we do not evaluate them separately. We vary the size of the intersection from 200 to 6,000, and this size can be viewed as the size of I or the size of I ∪ I * . As shown in Figure 6, the CSP CPU time and the size of the intersection are negatively correlated.
is is because the main computation is to do the extended Euclidean polynomial over the supplementary set of the intersection. erefore, the larger the intersection size is, the faster the computation will process. At the meantime, the client CPU time is also near constant and around 6.3 ms. e VO size depends on the size of the intersection.
ere is an optimization scheme (MP * ) that we can utilize in real-world applications. at is when the client only wants to know the individual he issued is a positive contact or not and does not want to know the accurate intersection between the two datasets. en, instead of generating a proof for the whole intersection I and I * , CSP only needs to generate a proof that can prove one random record of the intersection belongs to both the client's records set and the diagnosed people's records set. And, the client can easily check that whether the chosen record is belonging to his or not; then, the CSP only needs to prove that the chosen record also belongs to the diagnosed people. As shown in Figure 6, the CSP CPU time is highly reduced compared to the original version, and it only costs a constant time around 6.07 ms. And, the client CPU time is also reduced to around 1.73 ms caused by the client only needs to do the simple subset verification computation where the subset only contains 1 record. en, we can easily find out the VO size is also almost constant as the VO only contains 1 record (about 1 KB) and its proof (128Bytes).
Finally, we evaluate the storage overhead in the client side. Instead of storing the full version of all the records in   the client side, the client only needs to store the ADS AccDigest in the client side, and it only costs 128Bytes; compared with the real-world datasets Foursquare (11.8 MB), Gowalla (25.7 MB), and Transportation (46.1 MB) which we utilized in our experiment, it highly reduces the heavy storage burden in the client side.

Conclusion
In this paper, we study the problem of publicly verifiable contact tracing in cloud computing. To achieve both accuracy and integrity of contact tracing results, we develop a novel set accumulator-based ADS scheme that enables efficient verification and low storage overhead. Based on this building block, we propose the PvCT framework and give a detailed discussion of different contact tracing phases. e robustness of the proposed building block is substantiated by rigorous security proof based on q-Strong Bilinear Diffie-Hellman assumption. Empirical results over three realworld datasets show that our algorithm is practically feasible within milliseconds of client CPU time and can reduce the storage overhead from the size of datasets to a constant.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.