Efficient Private Set Intersection Using Point-Value Polynomial Representation

Private set intersection (PSI) allows participants to securely compute the intersection of their inputs, which has a wide range of applications such as privacy-preserving contact tracing of COVID-19. Most existing PSI protocols were based on asymmetric/ symmetric cryptosystem.(erefore, keys-related operations would burden these systems. In this paper, we transform the problem of the intersection of sets into the problem of finding roots of polynomials by using point-value polynomial representation, blind polynomials’ point-value pairs for secure transportation and computation with the pseudorandom function, and then propose an efficient PSI protocol without any cryptosystem.We optimize the protocol based on the permutation-based hash technique which divides a set into multisubsets to reduce the degree of the polynomial.(e following advantages can be seen from the experimental result and theoretical analysis: (1) there is no cryptosystem for data hiding or encrypting and, thus, our design provides a lightweight system; (2) with set elements less than 212, our protocol is highly efficient compared to the related protocols; and (3) a detailed formal proof is given in the semihonest model.


Introduction
Private set intersection (PSI) can be described that participants complete computation based on their private inputs and cannot learn additional information other than the set intersection. PSI has a wide range of applications such as privacy-preserving contact tracing for infection detection [1,2], private contact discovery [3], similar document detection [4], suspects detection [5], relationship path discovery in social networks [6], and satellite collisions matching [7].
PSI has been well studied. Several cryptographic technologies have been proposed to implement PSI. According to cryptographic techniques involved, PSI protocols are mainly divided into the following three categories: (1) PSI based on the public-key technology: the main cryptographic technique was homomorphic encryption. e protocols were designed in such a way that the sender encrypted sets and the receiver performed some operations on the ciphertexts using the property of homomorphic encryption; then, the sender decrypted them by using his private key and got the intersection. With small communication complexity, these protocols were suitable for the scenario where the participants had strong computing power but the communication bandwidth was a bottleneck. However, the protocols had a higher time complexity because of using public-key cryptography.
(2) PSI based on the generic circuit: the protocols transformed any function into garbled Boolean circuit and then completed the generic secure computation. e circuit generator encrypted each circuit gate using a double symmetric cryptosystem and generated a garbled circuit; the evaluator computed keys for the output wires by decrypting the appropriate ciphertexts without learning any intermediate values. e key technique used in the protocols was symmetric cryptosystem. e advantage of the general circuit protocol was that it made the protocol easier to design and implement. But as a general solution, the garbled circuit could not achieve scalability, and the protocols were inefficient. (3) PSI based on the oblivious transfer (OT) scheme: this kind of protocols introduced some variants of OT. e protocols were that elements were stored in some data structures, and parties ran an OT for each bit of inputs to get private outputs. en, each party performed XOR operations with random values and its own elements. Lastly, the sender sent the results to the receiver, who locally checked the existence of its inputs. To improve efficiency, most of OT variants were implemented by using the symmetric cryptosystem. us, these protocols had lower time complexity and communication complexity. Nevertheless, such protocols required additional keys-related computations such as secret key negotiations.
From the above analysis, PSI protocols based on publickey cryptosystem suffer from two constraints: low efficiency and needing a complicated system for private/public-keys management. On the other hand, PSI protocols based on symmetric cryptosystem have higher efficiency, but negotiating or secure transferring of secret keys leads to additional computations and communications. Furthermore, the secure storage of keys will burden the system. In the paper, we transform the problem of the intersection of sets into the problem of finding the roots of polynomials by using pointvalue polynomial representation and propose an efficient PSI protocol without any cryptosystem.

Application Scenarios.
Our work can be applied to the following several practical scenarios.

Contact Tracing for Infection Detection.
e COVID-19 pandemic has posed an unprecedented challenge for humans. Due to the highly contagious nature of the virus, social distancing is one fundamental measure that has already been adopted by many countries. Based on the matching of location information between infected patients and regular people, contact tracing for infection detection enables users to securely upload their data to the server, and later, in case one user got infected, other users can check if they had ever got in contact with the infected user in the past. To protect users' private location information, PSI can be applied to securely compute shared location data.

Suspects Detection.
Two national law enforcement bodies have a list of suspected terrorists. Due to national laws, they may not be allowed to disclose their whole lists, even when collaborating. Using a PSI protocol, both agencies can find commonly suspected terrorists and share their information, while other relevant information will not be disclosed.

Satellite Collisions.
Different space agencies have their own orbiting satellites. In order to determine the collision problem of the same orbiting satellite pair and adjust the orbit of the satellite appropriately, these agencies need to share more detailed information. However, each agency does not want to disclose anything other than whether there was a collision in orbital information. us, it is necessary to use PSI for computing the probability of a collision among satellites without revealing their other private information.

Contributions.
We transform the problem of the intersection of sets into the problem of finding roots of polynomials by using point-value polynomial representation and propose a new approach to PSI protocol without any cryptosystem. en, we optimize our protocol based on the permutation-based hashing technique that reduces the length of the stored elements and the degree of the polynomial. Eventually, our protocol and the related PSI protocols are implemented on the Linux platform. e main contributions are as follows.

A New Approach to PSI Protocol.
We propose a new approach for designing PSI protocol based on point-value polynomial representation and pseudorandom function. Firstly, we represent sets as polynomials' point-value pairs. Each party denotes d elements (s 1 , . . . , s d ) as a d-degree where n > d. Secondly, we blind polynomials' point-value pairs for secure transportation and computation. Each party blinds them as (x 1 , ρ(x 1 ) + z 1 ), . . . , (x n , ρ(x n ) + z n ) by using pseudorandom function and exchanges the blinded point-value pairs. irdly, we compute the sum of two blinded polynomials' point-value pairs. rough computation and transportation, one party can get the sum of two blinded polynomials' point-value pairs. Lastly, we can learn the polynomial by interpolation and get the intersection by computing the roots of the polynomial. With this representation, we could get the set intersection without any cryptosystem.

Efficient
Hashing PSI Protocol. We optimize the new PSI protocol using the permutation-based hashing method, which converts the hashed elements into shorter strings without collisions and reduces the degree of polynomials. e hashing is to create a two-dimensional table [max b ] and map each element to its hashed bins, resulting in b · max b stored elements, which split an n-degree polynomial into b · max b -degree polynomials. is approach improves efficiency remarkably.

Implementation of Our Hashing PSI Protocol.
We implement our hashing protocol and other related protocols in C/C++ on the Linux platform. We use Number eory Library (NTL) [8] along with GNU Multiprecision (GMP) library [9] for polynomial arithmetic. Based on the detailed experimental data, we conclude that our protocol is more efficient than public-key-based and circuit-based PSI protocols and is more efficient than OT-based PSI protocols at set elements less than 2 12 .

Organizational Structure.
e related works on PSI protocols are introduced in Section 2. In Section 3, polynomial representation, hash technique, and security definition are given. In Section 4, a new approach to PSI protocol without any cryptosystem is shown. In Section 5, an optimized PSI protocol with the permutation-based hashing method is proposed. Implementation and performance analysis are presented in Section 6. Finally, conclusion and future work are provided in Section 7.

Related Work
According to the underlying cryptographic techniques, PSI protocols can be divided into the following three categories.

PSI Based on the Public-Key.
In 1986, Meadows [10] introduced a PSI protocol that could solve the problem of authentication of mutually suspicious parties. But, they revealed the cardinality of sets during the authentication. To solve this problem, an improved PSI protocol [11] was proposed.
e PSI protocol based on oblivious polynomial evaluation [12] was proposed in 2004, which used the homomorphic encryption, balanced hashing, and properties of polynomials.
ey represented its elements as roots of polynomials and used interpolation to find out the coefficients of polynomials and sent the ciphertexts of coefficients of polynomials to the server by using ElGamal [13] or Pailler [14] encryption. In this protocol, it would lead to a high cost of exponential calculation in homomorphic encryption if the degree of the polynomial was large. An extended version [15] was presented, where the client and the server used the cuckoo hashing technique to reduce its computational complexity.
In 2009, Jarecki et al. [16] showed a PSI protocol based on the composite residual hypothesis. e protocol used additive homomorphic and zero-knowledge proof to realize the pseudorandom function and then performed the intersection operation on the random values of the set. e client and the server carried on the parallel oblivious pseudorandom function (OPRF) to get the intersection. However, the protocol relied on the common reference model.
In 2010, Cristofaro et al. proposed PSI and Authorized PSI (APSI) protocols [17,18]. But, these PSI protocols revealed the client's set cardinality. To hide the client's set cardinality, Ateniese et al. [19] presented a PSI protocol that was to batch the hash value of the client. In 2012, Cristofaro et al. [20] used RSA and OPRF techniques to reduce the total cost of cryptographic operations based on Cristofaro et al.'s constructions [17,18].
In 2017, Chen et al. [21] gave a PSI protocol with a low communication complexity based on the fully homomorphic encryption technology. In 2018, Chen et al. [22] implemented an unbalanced labeled PSI protocol against malicious adversaries by using OPRF into a preprocessing phase.

PSI Based on the Generic Circuit.
e two main approaches were Yao's garbled circuits [23,24] and Goldreich protocol [25], which were to replace arbitrary functions with Boolean circuit computations. e communication overhead and the number of cryptographic operations depended on nonlinear gates' number in the circuit. us, compared with the most special-purpose PSI protocols, the running time and communication complexity became more prominent problems for PSI protocols based on generic secure computation.
In 2012, Huang et al. [26] proposed several Boolean circuits for PSI protocols and evaluated based on Yao's circuit, which used homomorphic encryption and adopted various circuit optimization techniques. e main method was that the client and the server sorted the elements in their sets locally and merged them in order through the garbled circuit and determined the equality of adjacent elements in the merged set. If they were equal, they would be the elements in the intersection. In 2015, Pinkas et al. [27] presented a circuit-phasing PSI protocol, which was up to 5 times faster than [26].
In 2018, Pinkas et al. [28] used a two-dimensional cuckoo hashing technique to realize a PSI based on the generic circuit, where it was asymptotically with better efficiency and could be extended to multiparties. For the general assumption of linear communication, Hemenway et al. [29], based on Pinkas et al.'s construction [27], represented a simple and generic circuit-based PSI protocol in 2019.

PSI Based on the Oblivious Transfer (OT) Scheme.
In 2001, Naor et al. [30] proposed an OT protocol with asymmetric cryptographic operations, which spent expensive public-key operations when performing OT. Huberman et al. [11] used OT extensions (OTs) technology [31] to reduce expensive public-key operations by using more efficient symmetric cryptographic operations.
In 2013, Dong et al. [32] showed a PSI protocol that could process elements up to a size of 100 million. is protocol was based on bloom filter (BF), garbled bloom filter (GBF), secret sharing, and OTs. e linear complexity and high scalability of the protocol came from the effective symmetric cryptosystem and parallel processing, respectively. But, there was a problem with this protocol that the server might cause a selective failure to terminate the protocol in the malicious setting when the client performed a specific input. us, Rindal et al. [33] brought up an efficient fix using the cut-and-choose approach. Based on the method [30], Pinkas et al. [34] optimized it by replacing OTs with random OT, which did not need to save the GBF structure, but let the server and the client generate BF structure as the input of OT.
In 2015, Pinkas et al. [27] applied the phase and permutation hashing methods, which resulted in a reduction of computation and memory. Kolesnikov et al. [35] improved Pinkas et al.'s construction based on efficient OPRF. Subsequently, Kolesnikov et al. [36] proposed an extended version based on the literature [35], which gave a lightweight Security and Communication Networks protocol. Rindal et al. [33] gave the first implementation of PSI protocol against malicious adversaries. In 2018, Pinkas et al. [3] analyzed the current exiting protocols in detail and optimized PSI protocol using OPRF and the hashing techniques. A new PSI protocol was constructed by Pinkas et al. [37] in 2019, which used the 2-choice hashing [38], sparse OT extension, and the polynomial slice and stream techniques to reduce the communication cost and improve the efficiency of the protocol. In 2020, Pinkas et al. [39] proposed a PSI protocol based on a probe-and-XOR of strings (PaXoS) data structure, which not only had linear communication and computational complexity, but also can safely resist the malicious adversary in a nonprogrammable random oracle.

Representing Set with Polynomial Point-Value Pairs.
We give the transformation from operations of sets to operations of polynomials. is representation allows us to represent a set using a random point evaluation polynomial. Definition 1. Polynomial representation of a set. Given a set S � s 1 , . . . , s d , whose set cardinality is d; then, we define its characteristic polynomial as and thus every element s i ∈ S for 1 ≤ i ≤ d is a root of ρ(x).
where  { } * ↦ [1, b]. To contain the multiple elements, HT will be denoted as a double array e insert function of simple hashing can be described as where max b is introduced to represent the maximum number of each bin of hash table.

Permutation-Based
Hashing. Permutation-based hashing technique is to allow the hashed elements to be converted shorter strings that can be stored in the hash table for reducing storage space and computation complexity, which was proposed by Arbitman et al. [40]. Originally, an element x � x L |x R is represented as bits, where |x L | � log b , b is the bins' size in the hash table. en, the element x gets the index, x L ⊕H(x R ), where H is a random function: 0, 1 { } * ↦ [1, b]. Finally, the value stored in the bin is us, the stored data's length is significantly reduced and efficiency will be improved.

Security Definitions.
is section focuses on the security definition of PSI protocol.

Adversary.
We consider a semihonest adversary who follows the protocol specifications while trying to obtain extra information from the exchanging messages.

Functionality.
e functionality being implemented in this paper is  is is formalized by the simulation paradigm. e view of the party U during the execution of protocol π on input tuple (S (A) , S (B) ) is denoted by view π U (S (A) , S (B) ) that includes his input and output, internal random coins, and messages exchanged. We say that π privately computes f ∩ if there exist polynomialtime simulation algorithms, denoted as Sim A and Sim B , such that where " ≡ " represents two views that are computationally indistinguishable.  Figure 1 and has the following four steps.

The New PSI Protocol Based on Point-Value Polynomial Representation
(a) Setup: party A constructs a public finite field F p where p is a large prime, a pseudorandom functionPRF: 0, 1 { } * × Z p ⟶ F p that generates pseudorandom values in F p , and a vector x → with n � 2 d + 1 distinct nonzero values picked randomly from F p . en, it publishes F p , PRF, and x → .
(b) Initialization: each party I performs the following steps: (1) Select a dummy number tk (I) and compute pseudorandom values r (I) i � PRF(tk (I) , i) for 1 ≤ i ≤ d and then generate random polynomial

Security and Communication Networks
(c) Intersection interaction: party A tries to get the sum of two blinded polynomials point-value pairs whose roots are the intersection with party B. To do so, the following computations will be performed.
(3) Party A computes the blinded vector q → , whose elements q j for 1 ≤ j ≤ n are computed as follows: (4) Party B removes the blinding factors z (B) j for 1 ≤ j ≤ n as follows: en, it sends the vector c → to party A.
(d) Intersection result: party A gets set intersection.
(1) Party A unblinds the blinding factors z (A) j for 1 ≤ j ≤ n as follows: (2) Party A restores polynomials φ(x) by using pointvalue pair interpolation (x j , y j ) for 1 ≤ j ≤ n.
If it holds, it is an element of intersection; otherwise it is not.

Correctness of the New Protocol.
Because for 1 ≤ j ≤ n, Next, en, And we can get the polynomial φ(x) by point-value pairs . From Definition 3, we can get the intersection by computing the roots of the polynomial φ(x).
us, the new protocol is correct.

Efficient PSI Protocol Using Hashing
We optimize the above protocol using the permutationbased hashing. At first, each party constructs a two-dimensional hash table HT, where the first dimension is the index of the hashed element and the second dimension stores the elements. en, each party pads the second dimension with random values to the maximum load. e permutation-based hashing makes each party break down its original set into several small subsets. us, it will greatly reduce the degree of polynomials and then significantly improve the efficiency of the protocol. Let e hashing PSI protocol is shown in Figure 2, and the details include the following steps. (1) Create a hash table HT (I) by doing the following: (2) Generate max b pseudorandom values and construct a random polynomial as follows: (3) Construct a polynomial ρ (I) j (x) to represent the elements in the bin HT (I) [j]: (4) Choose a random number mk (I) j and get n pseudorandom values z (I) j,t that are used to blind the polynomial values: (5) Compute vectors c → j (I) and ρ → j (I) with values c (I) j (x t ) and ρ (I) j (x t ), respectively, for 1 ≤ t ≤ n, which are used to represent polynomials c (I) j (x) and ρ (I) j (x). (d) Intersection interaction: party A tries to get the sum of two blinded polynomials point-value pairs whose roots are the intersection with party B. To do so, the following computation will be performed.
j,t for 1 ≤ j ≤ b and 1 ≤ t ≤ n and sends the vector (2) Receiving party A's message, party B blinds every value as follows: where 1 ≤ j ≤ b and 1 ≤ t ≤ n. en, it gets the blinded vectors q → � (q 1 → , . . . , q b → ) and sends them to party B. (4) Party B removes the blinding factors z (B) j,t as follows: j,t for 1 ≤ t ≤ n as follows: (2) Party A restores the subpolynomial φ j (x) by using the point-value pairs interpolation ( x → , y j �→ ).

Security and Communication Networks 7
(3) Party A finds the elements of the intersection S (A) ∩ S (B) by computing the roots of polynomial φ j (x) as follows: denotes the size of the actual elements in bin j.

Correctness of the Hashing Protocol.
Because for 1 ≤ j ≤ b, 1 ≤ t ≤ n, Next, (23) en, And the polynomials φ j (x) are restored using the pointvalue pairs (x j , y j,t ), j (x) · ρ (A) j (x).. From Definition 3, the intersection S (A) ∩ S (B) could be learnt by finding roots of the polynomials φ j (x) (1 ≤ j ≤ b).
us, the hashing protocol is correct.

Security
Proof. e above hashing PSI protocol is securely computing the set intersection in the presence of a semihonest adversary.

Theorem 1.
If PRF is a pseudorandom function, then the hashing PSI protocol π is secure in the presence of a semihonest adversary.

Evaluation
6.1. Implementation. We ran our experiments in Ubuntu 18.04 with Linux 4.4.0.59 64-bit desktop PC. All protocols were implemented and executed using the same hardware equipped with Intel Core i7-7700K CPU with 3.6 GHz and 8 GB of RAM. We implemented our protocol and related protocols [3,18,26,32,34] in the same environment setting. Our protocol and related protocols had the same number of input elements, whose size was 32 bits. Our protocol was implemented using the Number eory Library (NTL) along with the GNU Multiprecision (GMP) library for polynomial arithmetic.
We give the running times of related protocols in Table 1 and Figure 3. From them, it can be seen that our protocol is more efficient than public-key-based and circuit-based PSI protocols, and it is more efficient than OT-based PSI protocols with the set size less than 2 12 .

Experimental Results.
A detailed analysis with related PSI protocols is given in Table 2. We evaluate the performance in terms of four properties: needing cryptosystem or not, simulated-based security, computation complexity, and communication complexity. From Table 2, our protocol enjoys the following advantages. (1) ere is no need for a complicated cryptographic system in our protocol, which only uses hashing and pseudorandom function and provides a lightweight system. But, in other protocols, asymmetric encryption system or symmetric encryption system is needed. (2) Our protocol gives a detailed formal security proof by using the ideal/ real simulation mechanism in the standard model while [3,26,34] only show an informal security analysis. (3) Computation complexity and communication complexity of our protocol are O(d), while both of [26] are O(d · log d ).

Conclusion and Future Work
In this paper, we proposed a new approach to PSI protocol without any cryptosystem based on point-value polynomial representation and pseudorandom function and optimized it based on hashing techniques. Our protocol had high performance with set elements less than 2 12 . In our protocols, there was a constraint that both parties should have the same set degree. In the future, we will extend our approach and study PSI protocols with a lightweight client where the server had a very large degree but the client's degree is relatively small.

Data Availability
All the pseudocodes used to support the findings of this study are included within the article.

Conflicts of Interest
e authors declare that they have no conflicts of interest.