^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

Searchable symmetric encryption that supports dynamic multikeyword ranked search (SSE-DMKRS) has been intensively studied during recent years. Such a scheme allows data users to dynamically update documents and retrieve the most wanted documents efficiently. Previous schemes suffer from high computational costs since the time and space complexities of these schemes are linear with the size of the dictionary generated from the dataset. In this paper, by utilizing a shallow neural network model called “Word2vec” together with a balanced binary tree structure, we propose a highly efficient SSE-DMKRS scheme. The “Word2vec” tool can effectively convert the documents and queries into a group of vectors whose dimensions are much smaller than the size of the dictionary. As a result, we can significantly reduce the related space and time cost. Moreover, with the use of the tree-based index, our scheme can achieve a sublinear search time and support dynamic operations like insertion and deletion. Both theoretical and experimental analyses demonstrate that the efficiency of our scheme surpasses any other schemes of the same kind, so that it has a wide application prospect in the real world.

Nowadays, with the development of the network and virtualization technology, cloud computing technology has been developed rapidly. Through the cloud service, enterprises and individuals can obtain better computing and storage services at a lower cost. Since cloud servers are not entirely trusted, utilizing cloud services while maintaining data privacy is an essential concern. A straightforward way to address this issue is encrypting the data before outsourcing it to the cloud servers. However, this approach fails to meet the requirement of data retrieval since traditional encryption will scramble the original data, making the data inconvenient to utilize. In this scenario, the users have to download all the ciphertext data and decrypt them locally, which will bring huge transmission, storage, and computation overhead, which is not applicable in cloud environment.

Searchable encryption (SE) can support keyword search without decrypting the data, and thus it is very suitable to achieve the keyword search over ciphertext. Based on the SE scheme, data owners and authorized users share a secret key. Data owners can encrypt the sensitive data and upload them to the cloud server. If data users want to search the encrypted data, they can generate an encrypted trapdoor by using the query and the secret key. When the cloud server receives the trapdoor, it tests the trapdoor against the encrypted data without decrypting these data and returns the data related to the query to the users. The first searchable symmetric encryption (SSE) scheme was proposed by Song et al. [

The primary ranked search schemes were proposed in [

Another kind of SE is called searchable public key encryption (SPE), which is established on the public key system. In SSE, the key for encrypting data is the same as the key for generating search trapdoor. By contrast, in SPE, the public key for encrypting data is open to public, while the secret key for generating search trapdoor is only given to the authorized data receivers. The very first SPE scheme supporting keyword search was introduced by Boneh et al., and it is so-called public key with keyword search (PEKS) [

Comparison between previous SE schemes and ours.

Type | Ref. | Query condition | Additional special abilities |
---|---|---|---|

SSE | [ | Conjunctive keyword search | — |

[ | Multikeyword fuzzy search | — | |

[ | Single-keyword ranked search | — | |

[ | Multikeyword ranked search | — | |

[ | Multikeyword ranked search | Dynamic update | |

Ours | Multikeyword ranked search | Semantic search and dynamic update | |

SPE | [ | Conjunctive and disjunctive keyword search | — |

[ | Boolean keyword search | — | |

[ | Single-keyword search | Fast search | |

[ | Boolean keyword search | Access control | |

[ | Multikeyword search | Data sharing | |

[ | Multikeyword search | Semantic search |

The previous ranked search schemes in symmetric key setting are secure and somewhat efficient. However, the index building, trapdoor generation, and search time are all related to the size of the dictionary generated from the dataset, which is not suitable for the big data environment. According to the statistical information given in [^{6}). Therefore, it is necessary to construct a more efficient ranked search scheme. Motivated by this, in this paper, we aim to construct a novel SSE scheme supporting dynamic multikeyword ranked search (SSE-DMKRS) with high efficiency.

The main contributions are summarized as follows:

Based on “Word2vec” [

We propose an efficient index building algorithm which can create a balanced binary tree to index all the documents. The obtained index tree can achieve a sublinear search time and support dynamic update operations.

Through applying the secure

In addition, we implement our scheme on a widely used data collection. The experiment results show that our scheme extremely reduces the time cost of index building, trapdoor generation, keywords search, and update without losing too much accuracy; e.g., the time cost of index building in our scheme is nearly 10% of that in the previous schemes. Meanwhile, the storage cost of encrypted index is also reduced greatly; e.g., the storage cost of the index in our scheme is nearly one percent of that in the previous schemes. In conclusion, compared to the previous SSE-DMKRS schemes [

This paper is organized as follows. In Section

In this section, we first give the framework of the system model and introduce the threat model adopted in our scheme. Then, we introduce some tools adopted in our schemes, including a famous term representation method in the field of natural language processing, e.g., “Word2vec,” and the vector space model. Finally, we present the design goal of our scheme. In addition, the main notations used in this paper are summarized in Table

Notations.

A document set {_{1}, _{2}, …, _{n}}. | |

The number of documents in | |

The encrypted form of _{1}, _{2}, …, _{n}}. | |

_{i} | The keyword set {_{i}. |

_{i} | The number of keywords in _{i}, and |

The _{i}, and | |

The vector representation for _{i}. | |

The vector representation for | |

A node in the index tree. | |

The vector representation for the node | |

Vector representations are obtained by splitting | |

_{u} | The encrypted index for the node |

_{T} | The encrypted index tree of |

The keyword set, {_{1}, _{2}, …, _{t}}, for query. | |

_{j} | A keyword in |

The vector representation for query | |

Vector representations are obtained by splitting | |

The vector representation for _{j}. | |

_{Q} | The trapdoor of |

A dictionary, { | |

_{11}, _{12}, _{21}, _{22} | Matrices for encryption (encryption key). |

Matrices for decryption (decryption key). | |

The number of semantic keywords associated with each dictionary’s keyword. | |

The number of keywords in | |

The dimension of vector generated by using “Word2vec.” | |

The number of files returned to the user. |

The system model contains three different roles: data owner, data user, and cloud server. The data owner outsources a group of documents _{1}, _{2}, …, _{n}} to the cloud in ciphertext form _{1}, _{2}, …, _{n}}. Moreover, the data owner also generates an encrypted searchable index for keywords search operation. For each query of an arbitrary keyword set _{Q} of the query _{Q} from the data user, the cloud server searches against the encrypted index and returns the candidate encrypted documents. After this, the data user decrypts the candidate documents and obtains the plaintext.

As illustrated in Figure

_{1}, _{2}, …, _{n}} and generates a secure searchable index

System model of the keywords search over encrypted data.

Throughout the paper, we mainly utilize two threat models proposed by Cao et al. [

As mentioned before, we aim to build a secure and efficient SSE-DMKRS scheme. The design goal of our scheme is described as follows:

“Word2Vec” model is a shallow, two-layer neural network, which is used to convert words into a group of vector representations [

A vector space representation of words shows that “dog” is closer to “fox” since they share more common attributes than “dog” and “orange.”

Vector space model is a very popular method used in the field of information retrieval, usually along with the TF-IDF rule to realize the top-

Through applying “Word2Vec” to a corpus, we create a dictionary in which each keyword is associated with a vector representation.

For the keyword set _{i} = {_{i}, we obtain a vector _{i}.

For the query keyword set _{1}, _{2}, …, _{t}}, we utilize the dictionary to construct a vector

Note that the dimensions of _{i} and

In this section, we first give the algorithms of the index tree building and the search algorithm on this tree. Then, we give the concrete construction of our scheme and the dynamic update operations of our scheme. Finally, we give a detailed analysis of the security of our scheme.

In this section, we adopt a balanced binary tree to create the search index, which will be used in our main scheme. Inspired by the construction process in [

Formally, the data structure of the tree node _{l} and _{r} are pointers which point

In our index tree, let the dimensions of _{1} and _{2}, respectively, and given as follows:

_{1}: if the node

_{2}: if the node

Suppose that Min () and Max () are the functions of the minimum and maximum, respectively;

And,

We find that

An illustration of the above methods is given in Figure

An example of the vectors generation of the node _{1}: _{2}:

Based on the methods _{1} and _{2}, inspired by the tree building algorithm introduced in [_{1} to generate_{2}.

_{1}_{2}_{n}

Construct a leaf node _{i}, with _{l} = _{r} = _{i}_{1};

Insert

Create a parent node _{l} = _{r} = _{2};

Insert

Create a parent node

Insert

Create a parent node _{1} for the _{1};

Insert

Set

\\Note that the

An example of Algorithm

For a query vector _{3}. The illustration of this method is given in Figure

An example of the vector generation of the query _{3}:

For a query

We can utilize the above equation to evaluate which documents are the most related to the query. Moreover, we can verify that the score of the parent node is larger than its children’s score. This property can significantly reduce the number of nodes which will be checked in the search process.

The search process is given in Algorithm _{1}, _{2}, …, _{6}}, query vectors are

Split _{3};

_{l}

_{r}

Delete the element holding the smallest relevance score in

Insert a new element <

An example of an index tree and a search process on this tree is illustrated in Figure _{1}, _{2}, …, _{6}} in which the dimension of the vector for each node is 3. For each node

Moreover, Figure _{11} and _{12} of

Because the score between _{11} and _{12} and _{11} as the root node and compute the score between the query _{11}. Since the score between _{21} and _{22} and _{21} as the root node and add the leaf nodes _{1}, _{2} to the RList. After this, the subtree with _{22} as the root node will be traversed, and the leaf nodes _{3} and _{4} are reached. Since the number of files in RList is less than 3, _{3} is added to RList directly. For the file _{4}, since the number of files in RList equals 3 now, Algorithm _{4} and _{4} and _{4} is not added to the RList. At present, the subtree with _{11} as the root node has been traversed. Algorithm _{12} as the root node. As the score between _{12} and _{12} is smaller than the minimum score in the RList (this property is described in Section _{5} and _{6} will not be checked. Therefore, Algorithm _{1}, _{2}, _{3}}.

In this section, through combining the secure KNN algorithm [

KeyGen (

DictionaryBuild (_{1}, _{2}, …, _{n}}, the algorithm runs “Word2vec” to generate the dictionary

IndexBuild (sk,

Finally, for each node _{u}, an encrypted index tree _{T} is created.

TrapdoorGen (sk,

It generates a new keyword set Q′, which is initialized to an empty set.

Note that each keyword in the dictionary is associated with a group of keywords semantically related to this keyword. For each keyword

Then, based on _{3}. After this, it generates two random vector pairs

Finally, this algorithm generates

Search (sk, _{Q}, _{T}): for each node _{T}, the algorithm computes

According to equation (_{u} and the trapdoor _{Q} equals the value of Score (

Besides search operation, the proposed scheme also supports some dynamic operations, e.g., documents insertion and deletion, satisfying the requirement of real-world application. Because the proposed scheme is built over a balanced binary tree, the update operations are realized by modifying the nodes in the tree. Inspired by the update method introduced in [

UpdateInfoGen (sk, _{s}, _{i}, Utype): this algorithm is executed by the data owners and generates the update information {_{s}, _{i}} to the cloud server, where _{s} is a set containing all the update nodes, _{s} is an encrypted form of _{s}, _{i} is the target document, _{i} is an encrypted form of _{i}, and Utype is the update type. In order to reduce the communication cost, the data owners will store the unencrypted index tree on its own device. For the Utype

If Utype = “Del,” it means that the algorithm will delete a document _{i} from the tree. The algorithm first finds the leaf node associated with the document _{i} and deletes it. In addition, internal nodes associated with this leaf node are also added to _{s}. Specifically, if the deletion operation will break the balance of the index tree, the algorithm can set the target leaf node as a fake node instead of removing it. After this, the algorithm encrypts _{s} to generate _{s}. Finally, the algorithm sends _{s} to the cloud server and sets _{i} as null.

If Utype = “Ins,” it means that the algorithm will insert a document _{i} to the tree. The algorithm first creates a leaf node for _{i} according to the method _{1} introduced in Section _{s}. Then, based on the method _{2}, the algorithm updates the vectors of the internal nodes which are placed on the path from root to the new leaf node and inserts these internal nodes to _{s}. Here, the algorithm prefers to replace the fake leaf node with the new leaf node rather than insert a new leaf node. Finally, the algorithm encrypts _{s} and _{i} to generate _{s} and _{i}, respectively, and sends them to the cloud server.

Update (_{T}, _{s}, _{i}, Utype): this algorithm is executed by the cloud server to update the index tree _{T} with encrypted nodes set _{s}. After this, if Utype = “Del,” then the algorithm removes _{i} from _{i} to

Note that after a period of insertion and deletion operations, the number of keywords in the dictionary should be changed. Because the dimensions of the index and trapdoor vectors in the previous schemes are linear with the number of keywords in the dictionary, these schemes have to rebuild the search index tree. By contrast, our scheme will not be affected by this problem. For the proposed scheme, the dimensions of the vectors in the index and trapdoor are determined by the tool of “Word2vec” and set by the users. For example, if we set the dimension of the vector as 200, the dimension of each keyword’s vector is 200, and thus the dimensions of the vectors of

In this section, we analyse the security of the proposed SSE-DMKRS scheme according to the privacy requirement introduced in Section

^{N} different keyword sets since each semantic keyword can be chosen or not. For example, if a keyword _{1}, _{2}, _{3}}, then ^{3} keyword sets {_{1}}, {_{2}}, {_{3}}, {_{1}, _{2}}, {_{1}, _{3}}, {_{2}, _{3}}, and {_{1}, _{2}, _{3}}. Since the query ^{N} different semantic keyword sets. According to this method, the final similarity score is obfuscated by these random semantic keyword sets. As the analysis in [

In this section, we analyse the proposed SSE-DMKRS scheme theoretically and experimentally. A detailed experiment is given to demonstrate that our scheme can efficiently perform dynamic ranked keywords search over the encrypted data. Our experiment is run on Intel® Core™ i7 CPU at a 2.90 GHz processor and 16 GB memory size and is based on a real-world e-mail dataset called Enron e-mail dataset [

The process of index building mainly consists of two steps: (1) creating an unencrypted index tree by utilizing Algorithm ^{2}), which means that the time cost for index building mainly depends on the number of documents in

Since the dimensions of each node’s vector in X15 and G19 are both linear with the number of keywords in the dictionary (^{2}). Due to

Figure

Impact of

In addition, because the index tree has

Storage consumption of the index tree (MB).

Dictionary size | [ | [ | Vector dimension | Proposed |
---|---|---|---|---|

188 | 174 | 7 | ||

219 | 190 | 14 | ||

251 | 206 | 20 | ||

283 | 222 | 26 | ||

315 | 238 | 33 |

In our scheme, the query is converted to be two vectors ^{2}). By contrast, since the dimensions of query vectors in X15 and G19 are both ^{2}). Thus, the time cost of trapdoor generation of our scheme is much less than that in X15 and G19. Particularly, from Figure

In the search process, if the relevance score of an internal node

When the data owners want to insert or delete a document, they will not only insert or delete a leaf node, but also update ^{2}), the time complexity of an update operation is ^{2}). For X15 scheme, because the encryption time for each node is ^{2}), the time complexity of an update operation is ^{2}). For G19 scheme, because the internal nodes are based on the Bloom filter which is not encrypted, the time cost for updating the internal nodes can be ignored. Thus, the time complexity of update in G19 is ^{2}) since only the leaf node is encrypted. From Figure

The search precision of our scheme is affected by a group of semantic keywords related to the original index and query keywords. We measure our scheme by adopting a metric called “precision” defined in [

In addition, the semantic keywords in the index and query keyword set will disturb the relevance score calculation in the search process, which makes it harder for adversaries to identify keywords in the index and trapdoor through the statistical information about the dataset. To measure the disturbance extent of the relevance score, we use the following equation called “rank privacy” introduced in [_{i} is the rank number of the document

We compare our scheme to the schemes of X15 and G19 in terms of “precision” and “rank privacy.” Note that an important parameter in the previous two schemes is a standard deviation

The precision (a) and rank privacy (b) of searches with different numbers of retrieved documents (

From Figure

The dimension of the vector representation (

Impact of

The precision (a) and rank privacy (b) of searches with different vector dimensions (

From the experiment results, when

The experiment result shows that the precision of our scheme is less than that in the previous two schemes, while the rank privacy is more than that in the previous schemes accordingly. In addition, by using the “Word2vec” method, the vector representations used in our scheme contain the semantic information of the documents and queries. Based on these facts, we argue that the proposed scheme is suitable for applications requiring similarity and semantic search, such as mobile recommendation system, mobile search engine, and online shopping system.

In this paper, by applying “Word2Vec” to construct the vector representations of the documents and queries and adopting the balanced binary tree to index the documents, we proposed a searchable symmetric encryption scheme supporting dynamic multikeyword ranked search. Compared with the previous schemes, our scheme can tremendously reduce the time costs of index building, trapdoor generation, search, and update. Moreover, the storage cost of the secure index is also reduced significantly. Considering that the precision of our scheme can be further improved, we will construct a more accurate scheme based on the recent information retrieval techniques in the future work.

The data used to support the findings of this study is available from the following website:

The authors declare that they have no conflicts of interest regarding the publication of this paper.

The authors gratefully acknowledge the support of the National Natural Science Foundation of China under Grants nos. 61402393 and 61601396 and the Nanhu Scholars Program for Young Scholars of XYNU.