For both convenience and security, more and more
users encrypt their sensitive data before outsourcing it to a third party such as cloud storage service. However, searching for the desired documents becomes problematic since it is costly to download and decrypt each possibly needed
document to check if it contains the desired content. An informative query-biased preview feature, as applied in modern search engine, could help the users to learn about the content without downloading the entire document. However, when the data are encrypted, securely extracting a keyword-in-context snippet from the data as a preview becomes a challenge. Based on private information retrieval protocol and the core concept of searchable encryption, we propose a single-server and two-round solution to securely obtain a query-biased snippet over the encrypted data from the server. We achieve this novel result by making a document (plaintext) previewable under any cryptosystem and constructing a secure index to support dynamic computation for a best matched snippet when queried by some keywords. For each document, the scheme has
Cloud storage provides an elastic, highly available, easily accessible, and cheap data repository to users who do not want to maintain their own storage or just for convenience, and such a way of storing data becomes more and more popular. In many cases, especially when the users want to store their sensitive data such as business documents, it requires the security guarantees against the cloud provider since an internal staff may access to the data maliciously. Directly encrypting the sensitive documents using traditional encryption techniques such as AES is not an ideal solution since the user will lose the ability to effectively search for the desired documents.
One solution for effectively searching over encrypted data is
Another solution for effectively searching for the desired data is through content preview, which is the main topic of this paper. In modern search engine, if a user searches for a web page by keywords, the search engine will return the name, URI, and a small
However, obtaining a query-biased snippet from an encrypted data is quite challenging. For a general search engine, in order to get a query-biased snippet from a plaintext, it must scan each matched document dynamically, extract the snippets where the keywords occur, then rank the results and finally return the
There are two major security problems. First, the snippet is the part of a document; therefore the encryption scheme used may affect the snippet retrieval. We use a pad-and-divide scheme to preprocess the document to make it compatible with any cryptosystem such as DES and RSA. Second, the information in the index is private, and no partial information about the document should be leaked to the server. Therefore, we encrypt the index based on the core method of searchable encryption. Since each keyword maps an entry in the index, if queried by some keywords, directly returning the related score information without calculating leaks the information about the number of queried keywords (equals to the number of returned entries) to an eavesdropper, and it also costs multiple communication bandwidth as the number of requested keywords increases. A
In this paper, our contributions are the following. (1) To the best of our knowledge, we formalize the problem of securely retrieving query-biased snippet over encrypted data for the first time. We generalize the notion of
The rest of the paper is organized as follows. Section
We categorize the related work into four topics, and each topic is summarized separately.
Query-biased snippet refers to a piece of the content in a document that contains the queried keywords. Query-biased snippet generation schemes are widely used in modern search engine. It is also named
Our proposed scheme and security model are based on searchable encryption technique. The basic goal of searchable encryption is to enable a user to privately search over encrypted data by keywords. The first scheme was introduced in [
There are many functional extensions for the basic searchable encryption schemes. Reference [
Our proposed additive coding method is based on the core concept of homomorphic encryption. The classical homomorphic encryption schemes are based on group operation such as the unpadded RSA in [
We encapsulate a private information retrieval (PIR) protocol and extend the use of it in our scheme. PIR schemes allow a user to privately retrieve the
We write
We write
A function
Let
Let
Sometimes, property (
Let
A computational PBR protocol scheme is a collection of four polynomial-time algorithms
In our preview scheme, we adopt the computational PBP scheme as a primitive introduced in [
Before introducing the preview scheme, we first introduce a novel coding method called matrix additive coding (Matrix-AC) that enables addition of two rows in a matrix in a homomorphic fashion, which is very fast and suitable for dealing with small numbers (the integer is coded to a specific bit string) and is especially useful for computing statistical table in encrypted form. Since all operated integers are correlative, it is not a homomorphic encryption scheme which could encrypt data independently.
Matrix-AC is used in the preview scheme to construct the secure additive ranking index (SecARI). Becouse a large number of small numbers will be calculated in the preview scheme, using homomorphic encryption schemes is costly. Therefore, we use Matrix-AC scheme as a substitution for homomorphic encryption scheme to achieve optimal performance.
We note that, for all the schemes (including the preview scheme in the next section), we only consider the confidentiality of the data. Mechanism about protecting data integrity is out of the scope of this paper.
The basic idea of coding small integers
For example (as shown in Figure
Example of vernier mapping and the basic coding procedure.
The problem of the basic scheme is that the vernier may be used up. That is why we set the restriction that a vernier is just used in a single vector. Another drawback is that, as the max
We extend the basic idea to code the data matrix. Let
Input
(
Let us check the homomorphism for decoding: let
There is a problem if the scheme is directly used in the application. In the real world, there is no way to directly represent, for example, data of 5 bits (there is an extended “bitset” class in C++, but it treats the bits as a set, and all operations are performed over set, and it is very slow). In computer, the data is represented by “byte” that a valid number is stored in such a byte. Thus, 5-bit data is stored in one byte (8 bits) as a “character,” 12-bit data is stored in two bytes (16 bits) as a “short integer,” and a 20-bit data is stored in four bytes (32 bits) as an “integer.” Thus, the XOR operation is performed over byte, and the data should be extended to such standard length. However, since all data in Matrix-AC are in fact a bit string, sometimes the data in the same row could be “chained” together. For example, suppose
Intuitively, the scheme is secure if any two matrices (the numbers of elements are the same) prepared by the adversary are indistinguishable, which also implies that any two elements from the same matrix are indistinguishable. We define the security of Matrix-AC as follows.
If
We briefly prove the scheme since the mechanism is simple. We describe a PPT simulator
The preview scheme contains two steps: (1) storage at which the data owner prepares the previewable document and a searchable index; (2) retrieval at which the user privately retrieves the snippet from the server.
The basic idea of constructing a query-biased previewable document is as follows: divide the document into
Example of a snippet index.
Keyword |
|
|
|
|
---|---|---|---|---|
|
2 | 1 |
|
2 |
|
1 | 2 |
|
1 |
|
|
… |
|
|
|
2 | 5 |
|
3 |
|
|
… |
|
|
|
3 | 2 |
|
2 |
The main process of retrieving the best snippet by multi keywords follows the following steps. The user submits multikeywords to the server. The server retrieves the multirows in the index according to the submitted keywords and adds the rows together. The result is a single entry that contains the information about the best matched snippet. The user decrypts the entry, selects the snippet identifier (index number) with the highest score (for simplicity, the score equals the frequency), and privately retrieves the snippet from the server by running a PBR protocol. In order for the server to perform the “addition” operation over the encrypted data, a homomorphic encryption scheme could be used to encrypt the index. We adopt Matrix-AC as the encryption scheme instead of a standard homomorphic encryption scheme as discussed previously.
Now we begin to introduce the definition and the security model of the preview scheme. Note that we assume the server is honest but curious. Additional methods could be added to make those solutions robust against malicious attack; however, we restrict our discussion on honest-but-curious fashion. We also note that all documents are treated as text files the same way as search engine does. For example, if a document is a web page, the style tags will be pruned.
The secure-query biased preview (SecQBP) scheme contains two parties: a user
Without loss of generality, we consider the construction for a single document. The scheme could be extended to a document collection with ease. Now we define the SecQBP scheme as follows.
SecQBP scheme is a collection of six polynomial-time algorithms
Informally speaking, SecQBP must guarantee that, first, given the encrypted document
Let
We say that SecQBP is semantic secure against adaptive chosen keyword attack if, for all PPT adversaries
Note that, with
Now we describe the concrete construction for SecQBP. We describe the constructions for some core components, and then represent the complete construction.
We consider the problem of extracting keywords from a document. In general, a keyword is followed by a separator. Thus, in a general snippet of 50 characters, no more that 25 keywords are contained. Another problem is that not all words are keywords, and such words do not need indexing, for instance, the words “a,” “the,” and “and.” This kind of words can be found in most of the sentences such that it is useless as a key to index a file. They are called stop-word and firstly researched in [
There is a problem that the last word in a snippet may be cut off. In other words, the last word of a snippet may be not short enough to fit the space, and it cannot be split into two words because neither of them is a valid keyword. In a general search engine, such overflowed word is omitted. However, in the scenario of precomputing snippets, if the word is omitted, a keyword may be lost. It means that, when querying the omitted keyword, there will be no matched snippet returned, where actually there is a match for the document. Thus, we add the full word to both the keyword sets of the snippets which contain part of the keyword.
The basic idea for encrypting a document is dividing the document with equal size; therefore, a padding scheme is needed when the last piece of the document is not long enough. We modify the CBC plaintext padding scheme introduced in [
Let
set collection
The secure additive ranking index (SecARI) is the encryption form of the snippet index, as shown in Table
Example of a SecARI.
Index |
|
|
|
|
---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
PAD | PAD | PAD | PAD | PAD |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
PAD | PAD | PAD | PAD | PAD |
|
|
|
|
|
|
|
|
|
|
Let us consider the secure amount of the entries. If the document
To guarantee security, the number of entries
SecARI is in fact a sparse look-up table, and we use indirect addressing method to manage it. Indirect addressing method is also called
In addition, we make use of a pseudorandom permutation
Let
In order to hide the information about the number of queried keywords, a SecARI is not enough. When the user submits the queried multiple keywords, each query should be of the same length so that an eavesdropper cannot learn the information about the number of keywords in a query. Let the maximum number of keywords allowed in a single query be
We also determine the upper bound
Let
Note that it is a scenario for a single document. The protocol also works for a document collection. Thus, the user could retrieve multiple snippets for multiple documents in the same round.
The server stores the SecARI, performs homomorphic computation for a query, and returns to the user the score information as a single entry. We prove the security by introducing a theorem as follows.
If
We describe a polynomial-size simulator (Simulating (Simulating (Simulating
We claim that no polynomial-size distinguisher ( ( (
First, we compare the functionalities and performance of our work with previous works. Then, as a significant example, we discuss how to combine the preview scheme with symmetric searchable encryption to improve the user experience. We also discuss the performance of the preview scheme in the concrete application example.
Let
Comparisons of preview schemes.
Data type | Preview mode | Round | Communication | Storage | Computation | |
---|---|---|---|---|---|---|
General search engine [ |
Plaintext | Query biased | 1 |
|
|
|
Content mask [ |
Plaintext or ciphertext | Static | 1 |
|
|
|
Our scheme | Ciphertext | Query biased | 2 |
|
|
|
The query-biased preview mode is widely used in general search engine, as introduced in [
We review the generalized definition of symmetric searchable encryption (SSE) introduced in [
In guided mode, a symmetric searchable encryption scheme is a collection of six polynomial-time algorithms
The preview scheme is applied in SSE as follows. The user runs
We adopt SSE-2 introduced in [
Properties of SSE-2 + SecQBP.
Properties | SSE-2 | SSE-2 + SecQBP |
---|---|---|
Adaptive adversaries | Y | Y |
Number of servers | 1 | 1 |
Server storage |
|
|
Server computation |
|
|
Number of rounds | 2 | 3 |
Extracommunication |
|
|
Let
The detailed performance of SSE is analyzed in [
In order to demonstrate the optimization for the server, we compare our suggested Matrix-AC scheme with the simplest and, as far as we know, the fastest symmetric homomorphic encryption scheme [
The algorithms are coded in C++ programming language and the server is a Pentium Dual-Core E5300 PC with 2.6 GHz CPU. The result is shown in Figure
Time cost for computing scores (single server, 100 users).
In this paper, we propose a generalized method of securely retrieving query-biased snippet over outsourced and encrypted data, which allows the users to take a sneak preview over their encrypted data. The preview scheme has strong security and privacy guarantees with relatively low overhead, and it greatly improves the user experience.
Part of this work is supported by the Fundamental Research Funds for New Century Excellent Talents in Chinese Universities (Grant no. NCET-10-0298) and Ministry of Science and Technology of Sichuan province (no. 2012HH0003).