Secure Testing for Genetic Diseases on Encrypted Genomes with Homomorphic Encryption Scheme

The decline in genome sequencing costs has widened the population that can afford its cost and has also raised concerns about genetic privacy. Kim et al. present a practical solution to the scenario of secure searching of gene data on a semitrusted business cloud. However, there are three errors in their scheme. We have made three improvements to solve these three errors. (1) They truncate the variation encodings of gene to 21 bits, which causes LPCE error andmore than 5% of the entries in the database cannot be queried integrally. We decompose these large encodings by 44 bits and deal with the components, respectively, to avoid LPCE error. (2)Weabandon the hash function used inKim’s scheme,whichmay causeHCE errorwith a probability of 2−22 and decompose the position encoding of gene into three parts with the basis 211 to avoid HCE error. (3)We analyze the relationship between the parameters and the CCE error and specify the condition that parameters need to satisfy to avoid the CCE error. Experiments show that our scheme can search all entries, and the probability of searching error is reduced to less than 2−37.4.


Introduction
Genes are the intrinsic nature of human health.All human life activities and physiological phenomena are directly related to the gene.Genome data can be used for a wide range of applications including healthcare, biomedical research, and forensics [1].Gene sequencing technology is the core of the human genome project; the genome sequencing technology helps humans to better understand the whole life activities of cells and organisms, and it is also of great significance for the prevention and treatment of some diseases, such as cancer and genetic diseases.
Advances in high throughput technologies have made it increasingly affordable to sequence the human genome in various settings, ranging from biomedical research to healthcare [2].Relevant data show that in 2000 the cost of whole genome sequencing for human is nearly $3 billion, and by 2015, the cost of single genome sequencing is reduced to less than $1,000, and the sequencing costs for certain sites on the genome are lower.
The decline in genome sequencing costs has widened the population that can afford the cost of gene sequencing and has also raised concerns about genetic privacy.Genetic data can be widely used in healthcare, biomedical research and identification, and other fields, with a strong personal privacy characteristics.More and more businesses and individuals put the computing processing of genetic data to cloud services, but the current commercial cloud server does not fully guarantee the privacy and security of genetic data.This raises concerns about the privacy of sensitive information since data is stored in external, off-premise data centers.In particular in the health sector, sensitive personal patient records need to be kept confidential [3].
There are a number of technical solutions that have been proposed to protect genome privacy, and existing studies can be categorized into two groups [4]: (i) protecting the Security and Communication Networks computation process in genome data analysis [5][6][7] and (ii) protecting the genome data before computation [8,9] or research outcomes after computation [10].
In order to prevent the user's genetic data from being compromised by unauthorized users or organizations, protecting genetic privacy is an urgent problem.To mitigate the privacy risks inherent in storing and computing sensitive data, cryptography offers a potential solution in the form of encryption [11]; only the legitimate data owner can access the data by decrypting it using their private decryption key.
However, sometimes the calculation and analysis of genetic data need to be implemented in the cloud due to the limitation of personal computing power and genetic diagnosis algorithm patent, the need for cloud server, the user's genetic data analysis, and analysis to help get the user diagnosis and treatment of the relevant information.Nevertheless, traditional cryptographic schemes limit the computation process on the ciphertext stored on the cloud and also prevent the data center from performing computation on it without the decryption key.
Homomorphic encryption can do computation on the encrypted data in the case of unknown secret key, and the decryption results of the ciphertext data are equivalent to the corresponding plaintext for the same processing operations.In 2009, Gentry proposed the first FHE scheme and described the framework blueprint of the FHE [12].Since then many improvements to FHE have been proposed based on Gentry's work such as [13][14][15][16][17].
Homomorphic encryption-based methods which support secure genome data computation have been studied.Cheon et al. [5] studied how to calculate edit distance of encrypted gene data homomorphically.Yasuda et al. [18] described how to compute multiple Hamming distance values using the LNV scheme [19] on encrypted data.Graepel et al. [20] and Bos et al. [3] applied HE to machine learning and described how to privately conduct predictive analysis based on an encrypted learned model.Lauter et al. [7] gave a solution to privately compute the basic genomic algorithms used in genetic association studies.
To achieve safe genetic data analysis, iDASH (integrating Data for Analysis, Anonymization, and SHaring) National Center has released annual security challenges regarding genetic privacy protection since 2014.In 2016, the challenge of testing for genetic diseases on encrypted genomes (secure outsourcing) was published to calculate the probability of genetic diseases through matching a set of biomarkers to encrypted genomes stored in a commercial cloud service.The requirement is that the entire matching process (only consider the exact match for each variation) needs to be carried out using homomorphic encryption so that no trace is left behind during the computation.
For the challenge published by iDASH, Kim et al. give a practical solution called [KSC17], which uses the homomorphic encryption technique to encrypt the entire gene database as a polynomial on the ring, thus solving the challenge of testing genetic disease (security outsourcing) to a certain extent [21].
The application scenario of this paper is shown in Figure 1.There are three parties involved in this scenario: the user (hospital or medical institution that has patient's gene data), the semitrusted commercial cloud service, and data owner (the research institute that has the genetic variation database).The purpose of this system is to determine if a patient's gene data is presented in the gene variation database.System Initialization.Data owner encrypts the gene variation database and uploads the ciphertexts to the commercial cloud server.Then, the user interacts with the cloud server to complete the testing process.Step 1: the user encrypts the patient's gene data and uploads the ciphertexts to the commercial cloud server.Step 2: the cloud homomorphically searches user's gene data in database and generates a ciphertext of searching result.Step 3: the cloud sends the ciphertexts to the user.Step 4: the user decrypts the ciphertexts and concludes whether the patient's gene data is presented in the gene variation database.The source code of our implementation is available on github https://github.com/lonyliu/genetest.
Our Contributions.The contributions of this paper focus on optimizing the design and improving the correctness of the scheme.Through the analysis of the [21] and its related code, we found three types of query errors in [KSC17], called losing of partial coefficient error (LPCE), hash collision error (HCE), and coefficient combination error (CCE), and made some improvements as follows.
(1) The gene data is encoded by prefix code so as to detect more entries in the gene database with fewer bits than [KSC17].(2) Correcting the LPCE error: [KSC17] truncates the variation encodings of gene to 21 bits, which causes partial coefficient losing; thus more than 5% of the entries in the database cannot be queried integrally.
In this paper, we decompose the encodings of gene variation by 44 bits, then optimize, encrypt, and query the components, respectively.As a result, all the entries can be queried effectively in the database.(3) Correcting the HCE error: [KSC17] uses the method of hash function unreasonably, which may cause HCE error with a probability of 2 −22 .In this paper, we abandon the hash function by adding half the ciphertext of database, thus avoiding the hash collision.(4) Correcting the CCE error: [KSC17] cannot distinguish the different groups of gene data, which may return incorrect results with nonnegligible probability.In this paper, we analyze the relationship between the core parameter  snp (bit size of the encoding for gene variation) and CCE error and specify the condition that parameter  snp needs to satisfy, so that the probability of CCE errors is negligible.

Practical Homomorphic Encryptions
This section describes the homomorphic encryption schemes which are used in our genetic privacy protection.First, some symbols and parameters are described below.
For the security parameter , let integer  = () define the th cyclotomic polynomial Φ 푀 (), Throughout this paper, we assume that the integer  is a power of two so that  = /2 and Φ 푀 () =  푁 + 1.Both of our homomorphic encryption schemes operate in the polynomial ring R = Z[]/Φ 푀 ().[⋅] 푄 denotes the reduction modulo  into the interval (−/2, /2] ∩ Z of the integer or integer polynomial (coefficient-wise).Set the plaintext space to R 푡 fl R/R for some fixed  ≥ 2 and the ciphertext space to R 푄 fl R/R for an integer  = ().Let  = () denote a noise distribution over the ring R. Notation  ←  denotes that  is chosen from the distribution , and  푅 ←   denote that  is randomly chosen from the distribution .
We give the brief introduction of the RLWE scheme [22] and the Ring-GSW scheme [23].
2.1.The RLWE Scheme.First basic homomorphic encryption scheme is based on the hardness of Ring Learning with Errors (RLWE) assumption, which is proposed by Lyubashevsky, Peikert, and Regev.The RLWE assumption is divided into decisional RLWE assumption and computational RLWE assumption.The decisional RLWE assumption implies the infeasible solution to distinguish the following two distribu- The RLWE scheme is described as follows: (iii) RLWE.Enc(, pk): for the input plaintext  = ∑ 푖  푖  푖 ∈ R 푡 , choose a small polynomial V ∈ R and two Gaussian polynomials  0 ,  1 ← R, and output the ciphertext ct: (iv) RLWE.Dec(ct, sk): given the ciphertext ct = ( 0 ,  1 ), output the plaintext : (v) RLWE.Add(ct 1 , ct 2 , ct 3 , sk): given three ciphertexts ct 1 , ct 2 , ct 3 with the same secret key sk, output the

Conversion and Modulus Switching techniques have been introduced in [KSC17]. Conversion technique can change an RLWE ciphertext of
an LWE encryption of its constant term  0 .Modulus Switching technique reduces the ciphertext modulus  down to  while preserving the message, thus reducing the size of ciphertext.[16], which uses the approximate eigenvector method to express ciphertext as a matrix, so that the addition and multiplication of ciphertext no longer cause dimension expansion.In this paper, we use its RLWE version introduced by Ducas and Micciancio [23], and its encryption algorithm is given below: (i) RGSW.ParamsGen(⋅): given the same parameters and secret key  as in the RLWE scheme, set the decomposition base  푔 and exponent  푔 satisfying  푔 푑 푔 > .

The Ring-GSW Scheme. In 2013, Gentry et al. proposed an LWE-based homomorphic encryption scheme
Given a small matrix G = ( for 2 × 2 identity matrix I. (ii) RGSW.Enc(, sk): uniformly, and e Table 1: The format of genome data.
Let Dec 퐵 푔 (⋅) denote the decomposition with the base  푔 , so  can be regarded as an approximate eigenvalue of Dec 퐵 푔 (CT) with the eigenvector (1, s, . . ., Reference [17] defines a hybrid multiplication between an RLWE ciphertext ct ∈ R 푄 and an RGSW ciphertext CT ∈ R Thus the ciphertext ct H⋅mult is a RLWE encryption of  CT  ct .

Encoding and Encryption of Gene Data
Recall the task proposed by iDASH: secure biomarkers matching of encrypted genetic data, and in this section, we describe how to encode and encrypt the genomic data.

Genetic Data.
The gene data is stored in a semitrusted business cloud in VCF format.The database VCF file contains multiple genotype information entries, where each of them consists of chrome (chr), position (pos), locus (loc), reference (ref), alternate (alt), type.The example of database is shown in Table 1.Chrome represents the chromosome where the gene is located, and it ranges from 1 to 22, , and .Position represents the base position of the gene variation in the chromosome, and locus indicates the location of the gene.
Reference, alternate, type display the base transformation information for the variation: reference represents the base information before the mutation occurs; alternate represents the base information after the mutation; type indicates the type of the mutation, including the single base variation (SNP), multibase mutation (SUB), insertion variation (INS), and deletion variation (DEL).
In fact, the gene mutation can be located by chr and pos information only, and the information of base change can be obtained by comparing the ref base and the alt base.In order to improve the efficiency of the program, we only match the chr and pos information between the patient and the cloud, and then we get the corresponding ref and alt information of base variation at the same location in the database.Finally the user compares the base change information from the cloud and his base change information to get the final match result.

Encoding and Encryption of Genetic Data.
In this section, we describe how to encode the genomic data so that they can be applied to homomorphic encryption scheme.Let  푖 denote the position information of the th entry in the gene database,  푖 the variation information of the th entry in the gene database,  ref 푖 and  alt 푖 the integer encodings of reference genome and alternate genome, respectively.
For the coding of the gene position information, define a mapping from (chr, pos) to  푖 : In the following we describe how to encode the base variation information in [KSC17].Firstly, they represent the common SNPs by two binary numbers as A → 00 C → 11 (7) and encode them according to their order.Then pad with 1 to the left of the bit string so as to distinguish the A-string and empty string.For instance, the base A will be encoded as 1 | 00 = (100) 2 = 4, and string CG will be encoded as 1 | 11 | 10 = (11110) 2 = 30. SNP denotes the maximal number of reference (or alternate) alleles to be compared between the query genome and genomes in the target database; thus the length of the base string is  SNP = 2 ⋅  SNP + 1.And in [KSC17] the encoding of base variation information is expressed as Our Contribution.The value of  SNP in [KSC17] is set, respectively, to 2, 5, or 10, but the genovariation that more than 10base insertion or deletion may occur actually.For example, the second entry in Table 1 for column of "alt" genome is GGAGGTTTCAGT GAGCT.If the patient's alt genome (query gene information) is GGAGGTTTCA, the server will conclude that the patient is more likely to suffer from a genopathy.At the same time, we found that the numbers of ref bases and alt bases are usually not symmetrical through the statistical analysis of the genetic database, and the number of bases after concatenating the ref and alt genome does not exceed 20 mostly.Therefore, the prefix code is used to encode the genome data: Firstly, the SNPs are encoded as A → 00, Then, a string "111" is added to concatenate the ref and alt genome; this can help us to separate the encoding of ref and alt genome correctly.Finally, pad with 1 to the left of the bit string so as to distinguish the A-strings.Here is the formula of getting  푖 : Let  snp denote the bit size of  푖 , and set  snp = 44 so as to expand the number of gene entries which can be correctly matched.If the length of  푖 is less than 44 bits, then pad bit 0 at the left of the bit string to ensure that the length of the encoding is 44 bits.For the case of length ( 푖 ) > 44 bits,  푖 is divided by 44 bits; the details are in Section 3.3.3.
For example, the base variation information of the second entry in the Table 1 will be encoded as 00001111101000101001010111000100110001011001.And it can be decoded by the way shown in Figure 2.
After the process of encoding the genetic data, the database file will be encoded as a set of pairs ( 푖 ,  푖 ) for  = 1, 2, . . ., .The encoding results of Table 1 are shown in Table 2.
The HE scheme in this paper is carried out on a polynomial ring, so it is necessary to express the integer pairs as polynomial DB() = ∑ 푁 푖=0  푖  푖 ∈ R, where Since the  푖 from VCF files have bits size about 32, set R ≜ (Z[]/[ 푁 + 1]),  = 2 33 − 1.And then the data owner (research institute) encrypts the polynomial DB() with the RLWE public-key encryption scheme as described above.
The query genes are also encoded as a pair of integers (, ).However, the hospital or medical institution only needs to encrypt the monomial  −푑 with the RGSW symmetric encryption scheme.A corresponding mapping  is defined as the specific mapping from a term in a polynomial R 푡 to terms in polynomials R 2 푡 .
We found that there are three types of errors in [KSC17], named hash collision error (HCE), coefficient combination error (CCE), and losing of partial coefficient error (LPCE).In the following we will describe these errors and our solutions.

Hash Collision
Error.KSC17 made use of SHA-3 to transform 33-bit-size  푖 into a pair of two 11-bit-size integers  * 푖 and  † 푖 in order to improve the efficiency of the scheme.The hash function maps 33 bits of information to 22 bits, which may cause the collisions with a probability of 2 −22 approximately.This collision will result in a searching error.Take 10,000 entries in the database as an example, and suppose that the user queries the position of  1 , where hash( 1 ) → ( * 1 ,  † 1 ).The probability of at least one hash collision existing between the query and the database, with the same  * 1 and  † 1 , is 1 − (1 − 1/2 22 ) 10000 > 2 −9 ; that is, the user might get a wrong result with a probability more than 2 −9 .
What is more, this error cannot be avoided by repeating the algorithm.
Our Contribution.For the HCE error, we abandon the method of hash function, and decompose the index  푖 with the basis  = 2 11 , so that  푖 can be represented as Then, we extend the mapping Φ to mapping where  is the number of polynomial groups, DB * . And extend the corresponding mapping  to mapping   : for one  ∈ {1, . . ., } . ( As a result, we can effectively avoid the collision caused by the compression of the index  푖 and solve the HCE problem.

Coefficient Combination Error.
In this section, we will describe how the CCE error is happening.For ∀,  ∈ Z 푁 , the CCE error exists because [KSC17] cannot distinguish whether two coefficients  * 푗,표 and  † 푗,푝 , picked, respectively, from (DB * 푗 (), DB † 푗 ()) 푗∈{1,...,푘} , belong to the one mapping .This error may lead to the mistake that an entry that is not in the database was judged in the database.
There is a way to determine whether an integer pair ( 푖 ,  푖 ) is in the database.Firstly the integer pair then the integer pair ( 푖 ,  푖 ) is judged in the database; otherwise the integer pair ( 푖 ,  푖 ) is not judged in the database.
We give a brief description in Figure 3.The first line in Figure 3 represents the polynomial DB() with large dimension, and the second and third lines represent the polynomials DB * 푗 () and DB † 푗 () with small dimension.All nonzero coefficients in the polynomials are labeled with short lines.If patient want to check whether the query entry ( 푖 ,  푖 ) = (15,9) exists in the database with hash(15) → (1, 5), [KSC17] will give the conclusion that this entry exists in the database, since the coefficients of 5 1 and 4 5 satisfy 5 + 4 = 9.This will cause the mistake that the patient was misdiagnosed as sick.
Our Contribution.The CCE error means that the query entry ( 푖 ,  푖 ) does not exist in the database, but the scheme gives the result that the entry is in the database.If the  푖 is decomposed into  푖 * ,  푖 † , and  푖 ⊥ , this error will happen only if the sum of ), for one group  ∈ {1, . . ., }.The mapping Ψ : R 푡 → R 푘×3 푡 will generate  group polynomials, and the probability of at least one group has a collision with the given query entry Through a similar analysis, we find that the probability of CCE error in [KSC17] also satisfies this formula.When [KSC17] gives the parameters  snp = 5,  snp = 11,  = 100, this probability will be as high as There are two factors we need to consider for the parameter  snp .Firstly, the size of coefficients needs to be multiple of 11 bits.Secondly, the size of coefficients needs to be large enough to decrease the probability of CCE error, and the scheme's efficiency should be taken into account.Consequently we set  snp = 44 to reach the requirement for security and efficiency, and decrease the probability of CCE error to 2 −37.4 .

Losing of Partial Coefficient Error.
In this section, we will describe how the error LPCE is happening.[KSC17] sets  snp = 21 bits, while we found that there are some entries in the database whose encodings are more than 21 bits, and the longest one even needs 272 bits to be represented.For those entries whose encodings are more than 21 bits, [KSC17] truncates the encoding bit to the left 21 bits and abandon other bits.The truncation will cause the LPCE problem.In 3.2 we extend  snp to 44 bits, but also we cannot meet our needs for the correct matching.
Our Contribution.In order to solve this problem, we decompose these long encodings by 44 bits, then optimize, encrypt, and query the components, respectively.Suppose that the given entry with large coefficient is ( 푖 ,  푖 ).First,  푖 is decomposed by 44 bits; get a set of smaller components ( 푖,표 , . . .,  푖,1 ), where Second, we construct and output multiple new entries with smaller coefficients ( 푖,표 ,  푖 ), . . ., ( 푖,1 ,  푖 ).Finally, these entries are mapped by  and optimized separately.As a result, we can represent and query all entries effectively in the database and solve the LPCE problem.

Secure Searching of Gene Data
Section 4.1 introduces the optimized algorithm for coefficient and dimension of the DB() = ∑ 2 32 −1 푖=0  푖  푖 .Section 4.2 describes the searching algorithm of gene data.Section 4.3 shows our experimental results.

Dimension Optimization.
Since the encoded integers  푖 from VCF files have bits size of about 32, while taking into account the safety and efficiency for implementation of HE schemes, a dimension 2 11 <  < 2 16   Input: CT * 푄 = RGSW.Enc ( −푑 * , pk) ,   4.3.Implementation of Secure Searching.In fully homomorphic encryption scheme, the homomorphic operations are usually achieved by the polynomial additions and polynomial multiplications as well as bootstrapping processes.As bootstrapping involves costly homomorphic decryption operations, and homomorphic decryption operation requires larger ciphertext modules to prevent decryption errors, this leads to inefficiencies of fully homomorphic encryption.The encryption scheme in this paper is a lattice-based somewhat homomorphic encryption scheme.As few homomorphic operations and no bootstrapping processes are involved in our scheme, a smaller module can ensure the correctness of the scheme.Therefore, the scheme succeeds the efficiency of the [KSC17] scheme.For a 10K database, only hundreds of multiplications and additions of polynomials are needed for the commercial cloud server, so the test results can quickly return.Details are provided in the evaluation phase of Section 4.2.
According to the application scenario of our scheme, we have implemented a three-party (hospital institution, commercial cloud service, and research institute) interactive experimental platform and done experiments on databases of different sizes (database with 428 entries, 10k entries, 100k entries).We implemented our scheme on a 64-bit single core (i7-6700HQ) at 2.60 GHz, with OS Win 7. The experimental data are listed in Tables 3 and 4, and they are shown in Figures 5 and 6.The source code of our implementation is available on github https://github.com/lonyliu/genetest.Experimental results show that our scheme supports secure searching of gene data for all entries in the genome database (compared to about 5‰ incorrect searching for gene data in [KSC17]).What is more, based on guaranteeing high efficiency for secure searching of gene data, our scheme reduces the probability of searching error to less than 2 −37.4 .

Conclusion
In this paper, we discussed how to privately perform secure genomic searching on a semitrusted business cloud with homomorphic encryption.Our scheme can support secure searching of multibase mutation for arbitrary length.What is more, we have solved three errors, hash collision error (HCE),

Figure 1 :
Figure 1: The application scenario of secure testing for genetic diseases.

Figure 2 :
Figure 2: Coding structure of base variation information.

Figure 3 :
Figure 3: A brief description of CCE error.

Figure 4 :
Figure 4: The secure testing process for genetic diseases with details.

Table 2 :
Encoding of genetic data.

Table 3 :
Time of secure searching of biomarkers (ms).

Table 4 :
Storage of secure searching of biomarkers (MB).