MUSE : An Efficient and Accurate Verifiable Privacy-Preserving Multikeyword Text Search over Encrypted Cloud Data

With the development of cloud computing, services outsourcing in clouds has become a popular business model. However, due to the fact that data storage and computing are completely outsourced to the cloud service provider, sensitive data of data owners is exposed, which could bring serious privacy disclosure. In addition, some unexpected events, such as software bugs and hardware failure, could cause incomplete or incorrect results returned from clouds. In this paper, we propose an efficient and accurate verifiable privacy-preserving multikeyword text search over encrypted cloud data based on hierarchical agglomerative clustering, which is namedMUSE. In order to improve the efficiency of text searching, we proposed a novel index structure, HAC-tree, which is based on a hierarchical agglomerative clustering method and tends to gather the high-relevance documents in clusters. Based on the HAC-tree, a noncandidate pruning depth-first search algorithm is proposed, which can filter the unqualified subtrees and thus accelerate the search process. The secure inner product algorithm is used to encrypted the HAC-tree index and the query vector. Meanwhile, a completeness verification algorithm is given to verify search results. Experiment results demonstrate that the proposed method outperforms the existing works, DMRS and MRSE-HCI, in efficiency and accuracy, respectively.


Introduction
IT resources, such as computing and storage, are treated as the on-demand services in clouds nowadays, which is shown as "X as a Service."In order to reduce the cost of data management and storage, data owners (DOs) who have a large amount of data usually choose to outsource their data to clouds.However, DOs cannot directly control their data placed in remote cloud servers, which may cause concerns about their outsourced data being illegally acquired or abused by the cloud service providers (CSPs), especially for the privacy-sensitive data, such as medical records, government documents, and emails.Although many CSPs claim that their cloud services have several security countermeasures, such as access control, firewall and intrusion detection, doubts about the security, and privacy of outsourced data that are still the main obstructions to the wider development of cloud computing [1].
A general approach to protecting data privacy is to encrypt the data before outsourcing [2].However, this will make a significant difficulty and cost in terms of data management and utilization.In the field of information retrieval (IR), the existing retrieval technique based on multikeyword is mainly for the plaintext data and cannot be directly applied to the encrypted data.It is obviously unrealistic and wasteful to download all encrypted data from clouds to the local for decryption.In addition, due to hardware/software failures, storage device failure, and so forth, the search results may contain corrupted or incorrect data.If users cannot verify the completeness and correctness of search results, the upper-level decisions based on the search results may be misleading.Therefore, it is a challenge to research and give a searchable encryption scheme that supports verifiable privacy-preserving multikeyword text search over encrypted cloud data, which has become one of the hot issues in cloud computing recently [3][4][5][6][7][8].
In order to deal with the above problems, some encrypted data search methods [9][10][11][12] are proposed which utilize kinds of cryptographic techniques, such as homomorphic encryption and public-key cryptography.They are proved secure in text searching, but they usually need massive mathematical operations and cause high computational overhead.Hence, these methods do not adapt to the cloud computing scenario where data storage is very large and online data processing is the basic requirement.Besides, the relationship between documents is not taken into account during the search process, such as the category which describes the classification relationship of documents.In the design of encrypted documents search scheme, if we consider this type of documents relationship, it could improve the search efficiency and accuracy.However, the category relationship has been concealed by the blind encryption in the traditional methods.Therefore, it is desirable and helpful to maintain and utilize the category relationship to perform efficient and accurate text search for encrypted outsourced documents.
In this paper, we propose an efficient and accurate verifiable privacy-preserving multikeyword text search over encrypted cloud data (MUSE), which is based on hierarchical agglomerative clustering tree index (HAC-tree).We use the TF-IDF model and vector space model to represent every document and the interested keywords of queries as vectors, which means that each of them is considered as a point in a high-dimensional space.The secure inner product computation, adapted in the secure kNN [13], is used for measuring the relevance score between two documents or a document and a query.Based on the hierarchical agglomerative clustering, we propose a novel index structure, HAC-tree, which is constructed from the bottom leaf nodes to the upper root node.Firstly, documents are initialed as leaf nodes.Then the internal nodes are generated by clustering pairs of child nodes according to the sequence of their relevance scores level by level until the root node.Each internal node is with a pruning vector which is the extract maximum vector of its child nodes.On the basis of HAC-tree, we propose a noncandidate pruning depth-first search algorithm (NCP-DFS) for searching top-k relevance score documents recursively.The pruning vectors of internal nodes are used to filter the noncandidate subtrees which impossibly contain search results; thus the search space is narrowed and the search process speeds up while the result accuracy is not reduced.To verify the completeness and correctness of the search results, digests are generated for documents which are stored in leaf nodes of the HAC-tree.Upon returning the search results, a verification object (VO) is constructed for the result documents which is returned to data user (DU) along with the results.When DU receives the returned data from CS, a VO reconstruction procedure is performed to verify the completeness and correctness of the search results.
Our contributions of this paper are summarized as follows: (1) According to the basic idea of hierarchical agglomerative clustering, we propose a novel index structure HAC-tree and the corresponding bottom-up construction algorithm.The HAC-tree is the lowest binary tree if the leaf nodes number is fixed.Moreover, documents with higher relevance score between them (which means those documents may belong to similar categories) are always clustered.
(2) On the basis of HAC-tree, we propose a noncandidate pruning depth-first search algorithm (NCP-DFS) for multikeyword text search.It can prune the subtrees that surely do not contain any search results; thus the search efficiency will be improved.
(3) We define digests for documents and propose a result verification algorithm to check whether the search results are complete and correct.This algorithm can detect the damaged or incomplete results caused by hardware of software faults.
(4) Based on the above methods, we propose two verifiable privacy-preserving multikeyword text search schemes over encrypted cloud data, named BMUSE and EMUSE, respectively.BMUSE is a basic scheme which can resist known ciphertext threat, while EMUSE is an enhanced scheme which can resist known background threat.
(5) We compare the proposed method to DMRS [7] and MRSE-HCI [8] on a real dataset.Our results show that (i) the proposed method outperforms DMRS in terms of efficiency without losing accuracy; and (ii) the proposed method consistently gives more accurate results than MRSE-HCI while it may be less efficient than MRSE-HCI, depending on the choices of parameter values in MRSE-HCI.
The paper is organized as follows.Section 2 describes the related work.Section 3 gives the main notations and necessary preliminaries.Section 4 gives model and problem statement.Section 5 presents the index structure HAC-tree and its construction algorithm.On the basis of HAC-tree, the plaintext multikeyword search algorithm and search results verification are given.Section 6 presents the basic and enhanced schemes of the secure multikeyword text search.In Section 7, we analyze the security of the schemes.Section 8 carries on the experiment, comparing with the existing schemes in terms of the search results accuracy and the search efficiency.Section 9 concludes this paper.
(i) Single Keyword Searchable Encryption.Song et al. [26] first propose symmetric searchable encryption based on pseudorandom function and symmetric encryption mechanism and proof the security rigorously.Goh [14] gives security definitions for the formalization of the security requirements of searchable symmetric encryption schemes.Subsequently, many improvement and novel methods are proposed [27][28][29].Boneh et al. [10] propose a public-key ciphertext search algorithm named PEKS based on bilinear mapping and IBE encryption.In this algorithm, encrypted data by the public key can be authenticated by the gateway and sent to the corresponding user, but the real content will not be revealed.However, none of the above works ranks the results.The cloud server has to return all the results to meet the query request, resulting in unnecessary bandwidth and processing power overhead.In order to return only the most relevant search results, Wang et al. [30] present keyword ranked search over encrypted data based on TF × IDF and Order-Preserving Symmetric Encryption (OPSE).However, the above works only focus on single keyword search and cannot be applied to the scenario of multikeyword search which is the main concern of this paper.
(ii) Multikeyword Searchable Encryption.In the typical case of search over encrypted data in cloud computing, a single keyword search cannot express the data user search intention sufficiently.The cloud server will inevitably return an excessive number of matches, where most will probably be irrelevant for the user.Multikeyword search allows data users to characterize their own requests from multiple perspectives, ensuring that the search results are the most relevant documents with the query.Bilinear pairing-based solutions are presented in [31][32][33].The results in bilinear pairing-based solutions are free from false positives and false negatives caused by hashing.However, computation costs of pairing-based solutions are prohibitively high both on the server and on the user side.Pang et al. [34] propose a secure search scheme based on vector space model.Without the security analysis for frequency information and practical search performance, it is unclear whether there is keyword privacy disclosure or not.Besides, the practical search performance is absent from the demonstration of their experiment.Cao et al. [3] define and solve the challenging problem of privacy-preserving multikeyword ranked search over encrypted cloud data (MRSE), which adopt the similarity measure of "coordinate matching" to capture the relevance of data documents to the search query and ignore frequency information, leading to low accuracy in results.Meanwhile, MRSE needs a huge computational overhead.Sun et al. [4] use MDB-tree as index structure to improve the efficiency of MRSE, named MTS.Each level of the MDB-tree represents a subvector instead of an attribute domain in the database scenario, which leads to a decrease in accuracy.The index vector clustering further degrades the retrieval accuracy.Xia et al. [7] present a secure and dynamic multikeyword ranked search scheme (DMRS), which conduct a tree-based index structure to ensure accurate relevance score calculation between the encrypted document vector and the query vector.DMRS is significantly superior to MTS in accuracy.However, owing to neglecting the relationship of documents, this scheme still brings a lot of calculation cost.Therefore, there is still much room for improvement in search efficiency.In this paper, we focus on how to utilize the relationship of documents to improve the search efficiency.
Obviously, when constructing the index, if the similarity of documents can be classified and make their access paths as close as possible, it will make a huge contribution to the search efficiency through multikeyword of interest.Chen et al. [8] use the idea to propose a multikeyword ranked search over encrypted data based on hierarchical clustering index (MRSE-HCI).This method uses -means to cluster the documents based on the minimum relevance threshold and the maximum size of subcluster.Searching in the most relevant subcluster can achieve a linear computational complexity against an exponential size increase of document collection.Nevertheless, the significant improvement in efficiency is at the expense of accuracy, which reduces the accuracy and cannot fulfill user expectation well.In this paper, we not only improve the top- search efficiency but also ensure the accuracy of search results.Although the search efficiency of our work may be lower than MRSE-HCI in some cases depending on the choices of parameter values in MRSE-HCI, the search result accuracy is much higher than MRSE-HCI.
In addition, Sun et al. [35] use Merkle hash tree and cryptographic signature to build a verifiable MDB-tree.Chen et al. [8] design a minimum hash subtree as a verifiable structure.However, their works need to transmit a lot of verification object.Wan and Deng [36] propose the adapted homomorphic MAC technique and random challenge technique with ordering for verifying top- search results.However, the method requires a linear search of all documents and have poor efficiency performance.Hence, a proper mechanism should be adopted which really reduces the transmission cost of the verification object.

Preliminaries
Vector Space Model.Vector space model along with TF-IDF algorithm is very popular in the information retrieval area, which is also widely used in secure multikeyword search [3,4,7,8,16,18].We adopt the classic definitions of term frequency (TF) and inverse document frequency (IDF).The former refers to the number of times a given keyword or term exists in documents while the latter is calculated through dividing the total number of documents in the collection by the number of documents having the keyword.(2).Obviously, the lengths of the document vector and query vector are equal to the capacity of the keywords dictionary and each element of them is nonnegative.
where TF Relevance Score Measurement.In this paper, we use the same measurements in [8] to quantify the relevance score between a pair of documents and between a document and a query (which is represented by the interested keywords of DU).It is also used to quantify the relevance score between a pair of cluster centers and between a cluster center and a query.The calculation of the above relevance score can be unified as the inner product of two vectors which is shown in Secure Inner Product Operation.To achieve the goal of privacy preserving, we adopt the secure inner product operation which is proposed in [13].The operation is able to calculate the inner product of two vectors without knowing their plaintext value.Its basic idea is as follows.Assume that  and  are two -dimensional vectors and  is a random  ×  invertible matrix that is treated as a secure key.The encrypted forms of  and  are denoted as p and q, respectively, where p =   , q =  −1 .The inner product of p and q is calculated as (4) which indicates that p ⋅ q =  ⋅ .Note that we can get the inner product of two vectors without knowing the plaintext.
The space vector model, inner product of two vectors, and secure inner product operation are widely used in the existing works [3,4,7,8,17,18,35].In this paper, we will use them to design the secure multikeyword text search schemes over encrypted cloud data.

Model and Problem Statement
4.1.System Model.The system considered in this paper is the same as [3-5, 7, 8, 16-18] which consists of three entities: the data owner (DO), the data user (DU), and the cloud server (CS).As shown in Figure 1, their collaboration is as follows.
(1) DO Owns the Sensitive Data.To protect the privacy of the data, DO encrypts FS into ciphertext .In order to make  searchable in CS, DO constructs a secure index Ĩ on FS.Then DO outsources  and Ĩ into CS.Besides, DO can grant the authorizations of accessing their outsourced data to DU.
(2) CS Provides Services, Such as Data Storage and Management.On one hand, CS is in charge of storing the outsourced data from DOs.On the other hand, upon receiving a search trapdoor   from DU, CS performs the secure search by Ĩ and returns the qualified encrypted documents as search results to DU.
(3) DU Is the User Authorized by DO Who Can Legally Search the Data Stored in CS.When DU starts a multikeyword search, its interested keywords are transformed into a corresponding trapdoor   according to the authorization from DO. Then   is submitted to CS as the search command.After DU receives search results from CS, it decrypts the ciphertext and verifies the completeness and correctness of the result.

Search Model.
Given a set of  interested keywords   = { 1 ,  2 , . . .,   }, a multikeyword text search is the process of calculating relevance scores between all documents in FS and   and returning the top- results from the highest score.Formally, we define the multikeyword search as follows: where  is the quantity of requested documents and  ≪  generally.
Assuming that   is represented by a query vector , each document   ∈ FS is represented by a document vector   , and the relevance score between   and  is calculated according to (3).Then, the search results  of Query = (FS,   , ) satisfies

Threat Model.
We consider the same threat model as [3-5, 7, 8, 16-18], which assumes that DO and DU are trusted, but CS is considered as "honest-but-curious" model.That is to say, CS honestly implements protocols and returns search results, but it is curious about inferring and analyzing the plaintext outsourced data, searchable index, and messages that are received during protocol executions.We assume that CS knows not only ciphertext but also the encryption and decryption algorithms, but it has no idea of keys.According to the background information that CS knows, we adopted two threat models as follows.
Known Ciphertext Threat Model.In this model, CS not only observes the ciphertext, including the encrypted documents , the secure index Ĩ, and the search trapdoor   , but also knows a small part of plaintext index   and the corresponding encrypted index Ĩ .The threat model here is more aggressive than the known ciphertext model adopted in [3,7,8] and is the same as Level 3 adopted in [18].
Known Background Threat Model.In this stronger model, CS is supposed to possess more knowledge than the known ciphertext model, such as the term frequency statistics of document collection.This statistical information records how many documents are there for each term frequency of a specific keyword in FS, which can be utilized by CS to apply TF statistical attacks and hence deduce or even identify certain keywords through analyzing histogram and value range of the corresponding frequency distributions [7].
In addition, in some unexpected events, such as software bugs and internal/external attacks, CS may behave beyond the "honest-but-curious" model and return incorrect or incomplete search results.Thus, enabling search results verification mechanism that can detect useless results is also significant in terms of practicality and worth further investigation.

Problem Statement.
Given a search request Query = (FS,   , ) and  is the corresponding search results returned from CS.We focus on the multikeyword text search schemes over encrypted cloud data, whose goal is to achieve the following functions and security guarantees: (i) Multikeyword search: CS is able to return  which includes  most relevant documents with the query.
(ii) Search efficiency and accuracy: the comprehensive performance of search efficiency and accuracy should be ensured which means that the search process should be efficient while the accuracy of search results should not be reduced.
(iii) Privacy preserving: in the search process, the schemes should not reveal the confidentiality of documents, index, query keywords, and trapdoor unlinkability [3,7,8].
(iv) Completeness verification: DO can verify the completeness and correctness of search results to make sure that the result documents are right and useful.

MUSE Framework.
To achieve the multikeyword search scheme over encrypted cloud data (MUSE), we give the framework of MUSE shown in Figure 2. It consists of five modules, such as GenKey, BuildIndex, GenTrapdoor, DoSearch, and Verify, whose functions are described briefly as follows.
GenKey.DO generates the secure key SK.
BuildIndex.DO constructs the plaintext index tree  and then uses SK to encrypt  into the encrypted form Ĩ.
GenTrapdoor.DU transforms its interested query keywords   into the trapdoor   .Then DU sends   to CS.
DoSearch.CS performs the secure multikeyword text search to obtain search results and then return them to DU.
Verify.DU receives the search results from CS and verifies their completeness and correctness. . . . . . .

Index Design and Algorithms
According to the existing works, we know that the index is the key to realizing the secure multikeyword text search over encrypted cloud data, and its structural characteristics directly determine the search efficiency and accuracy of search schemes.It is reasonable that if documents in the same category are settled in the nearby access paths in an index, the search efficiency could be improved.Therefore, we adopt a hierarchical agglomerative clustering technology to design a novel index structure HAC-tree and give the construction algorithm.On the basis of HAC-tree, an efficient and accurate multikeyword search algorithm and search results verification algorithm are proposed.It is noticeable that the index in this section is unencrypted.

Hierarchical Agglomerative Clustering.
The hierarchical agglomerative clustering (HAC) is one of the important and commonly used clustering methods in data mining area.In the hierarchical agglomerative clustering procedure, the lower-level cluster pairs are merged into higher-level clusters according to the similarity of clusters until the root cluster is formed.By this way, a binary clustering tree is constructed from the bottom leaf nodes to the upper root node.A leaf node of a tree represents an initial cluster containing only one item while an internal node represents a cluster consisting of the items stored in the leaf nodes of the subtree whose root is the internal node.
Here, we use the relevance score to represent the similarity between a pair of clusters.The relevance score of a pair of clusters   and   is score(  ,   ) where   and   are the cluster center vectors of   and   , respectively.The cluster center vector computation is shown in the following definition.
Definition 1.One assumes that   = { 1 ,  2 , . . .,   } is a cluster where   ∈   is an item and its corresponding vector is   .The cluster center vector of   is denoted as  0 , then we have To describe the HAC method clearly, we give an example as shown in Figure 3 where items are documents in the collection FS = { 1 ,  2 , . . .,  8 }.The relevance scores between these documents are mapped to point-to-point distances in the 2-dimensional space where a point represents a document as shown in Figure 3(a).If the distance of a pair of documents is closer than the other pair, the relevance score of the former is higher than the latter.Figure 3(b) shows the clustering procedure and the corresponding generated clustering tree.At the beginning, documents in FS are initialized to 8 clusters that are corresponded to 8 leaf nodes in the clustering tree, that is, { 1 }, { 2 }, . . ., { 8 }.Then, round 1 of clustering starts which takes the 8 initial clusters as the current operated clusters.Their nearest pairs are merged into 4 new clusters which are corresponded to the parent nodes of the leaf nodes in the clustering tree, that is, { 4 ,  8 }, { 3 ,  7 }, { 2 ,  6 }, { 1 ,  5 }.After that, the new generated clusters are taken as the current operated clusters and round 2 of clustering starts similarly, so does round 3. The clustering process ends when the new cluster containing all documents of FS is generated which means that the root of the clustering tree is generated.It is noticeable that if the number of current operated clusters is odd, then there must be one of the current operated clusters left for the next round clustering.According to the above example as shown in Figure 3, a binary tree is constructed which has the following lemma.
Lemma 2. Let the tree constructed by the HAC method be  and one assumes there are  leaf nodes in .Thus, we have the following: (1) The height of  is ⌈log 2 ⌉ + 1.
(2) The height of any binary tree with  leaf nodes is not lower than ⌈log 2 ⌉ + 1.
Proof.(1) According to the HAC procedure, there are , ⌈/2⌉, ⌈/2 2 ⌉, . . ., ⌈/2 −1 ⌉ current operated nodes (clusters) in round 1, round 2, round 3, . .., and round  processing, respectively.The construction of  is finished when there is only one current operated node left which is indeed the root node; that is, ⌈/2 +1 ⌉ = 1.Thus, we have The relevance scores between documents described by the distances between them in the 2-dimensional space  = ⌈log 2 ⌉ + 1. Obviously, the height of  is equal to the number of clustering rounds.Therefore, we have that the height of  is ⌈log 2 ⌉ + 1 and Lemma 2(1) holds.
(2) Then, we give the proof by contradiction for Lemma 2(2).We assume that there is a binary tree   , having the same number of leaf nodes with , the height of which is smaller than ⌈log 2 ⌉+1.Let us assume that the height of   is ⌈log 2 ⌉.We all know that a binary tree, whose height is ℎ, has at most 2 ℎ − 1 nodes and the 0-degree nodes (which are leaf nodes) are 1 more than the 2-degree nodes (which are internal nodes having both left and right child nodes).Hence, we can get two observations: (1)   has at most 2 ⌈log 2 ⌉ − 1 nodes.Because It means that   has less than 2 − 1 nodes.(2) According to the given assumption, we known that   has  0-degree nodes, thus it has −1 2-degree nodes.A binary tree only has three kinds of nodes, 2-degree, 1-degree, and 0-degree nodes.Because the number of 1-degree nodes is at least 0,   has no less than 2 − 1 nodes.Obviously, a contradiction appears between the above two observations.Therefore, the height of any binary tree with  leaf nodes is not lower than ⌈log 2 ⌉+1.Lemma 2(2) holds.
Lemma 2 indicates that the tree constructed by the HAC method is the lowest of the binary trees which have the same number of leaf nodes.Thus, we can have that the time complexity from the root to any leaf node is no more than ⌈log 2 ⌉ + 1.

Index Construction Based on HAC.
In this section, we propose an index structure which is based on the hierarchical agglomerative clustering method.We name this index as HAC-tree.The construction algorithm of HAC-tree is also given.
The extract maximum vector of VS is denoted as  max =   → max{ 1 ,  2 , . . .,   } and Definition 4 (HAC-tree node structure).Node  of HAC-tree is a five-element tuple ⟨VM, PL, PR, FD, sig⟩, where .VM is the pruning vector, .PL and .PR, respectively, point to the left and right child nodes of , .FD stores the unique identifier of a document, and .sig stores a digest of the .FD document.
In addition, .VC is the cluster center vector of cluster   whose items are the documents represented by the leaf nodes of the subtree with node  as its root, and . is the number of documents of   .It is noticeable that .VC and . are only used for HAC-tree construction but not needed to store in HAC-tree.
According to the types of node , we give the detailed description of the HAC-tree node as follows: (1) If  is a leaf node, .PL = .PR = 0, .FD stores an identifier of a document, .VM and .VC both store the vector of .FD document, . = 1, and .sig stores a digest of .FD document which is used for search results verification and illustrated in Section 5.4. ( Security and Communication Networks Definition 5 (pruning vector).For node  in HAC-tree, we name .VM as the pruning vector of  which will be used to filter the noncandidate subtree and hence improve the efficiency of the multikeyword text search.
According to Definition 4, each node  in HAC-tree has two kinds of vectors.One is the pruning vector .VM, which is stored in HAC-tree.It can be utilized to filter the noncandidate subtree for search efficiency improving.The other is the cluster center vector .VC, which is not actually stored in HAC-tree but only used for HAC-tree construction.
Based on the definition of HAC-tree node structure and the hierarchical agglomerative clustering, we give the HACtree construction algorithm as in Algorithm 1.
According to the construction process of HAC-tree, we have the following lemmas.

Lemma 6. Assume that 𝑢 ∈ 𝐼 is an internal node where 𝑢.VM[𝑖] ̸
= 0 where  ∈ {1, 2, . . ., } and the subtree with the root  is denoted as   .Then one has the following: (1) There exists at least one leaf node V ∈   satisfying that the keyword   appears in the corresponding document of V.
( Proof.For any node V ∈   , there must be a leaf node  ∈   satisfying that V is in the path from  to .According to Lemma 7, we have score(V.VM, ) ≤ score(.VM, ).If score(.VM, ) <  is given, then score(V.VM, ) <  must hold.Therefore, Lemma 8 is true.
For a given query whose vector is , the goal of the multikeyword text search is to find top- documents which are most relevant to the query.We assume that  is a list storing candidate top- result documents and  is the minimum relevance score between the documents of  and the query .For any node  ∈  which is the root of subtree   , if score(.VM, ) < , then we have ∀V ∈   , score(V.VM, ) <  according to Lemma 8.It means that all the leaf nodes of   must not be the candidate result documents.Obviously, it is unnecessary to search   when we perform the query  on it.Hence,   can be directly pruned without further searching which improves the search efficiency.Based on this idea, we propose the noncandidate pruning depth-first search algorithm (NCP-DFS) as in Algorithm 2.
In Algorithm 2, (, ) is to obtain the minimum relevance score between the documents corresponding to the leaf nodes in  and the query .When Algorithm 2 is finished, the corresponding documents of leaf nodes in  are the final search result documents.
According to the relevance score definition in Section 3, the time complexity of a relevance score computation between a document and a query is ().Lemma 2 shows that the height of the HAC-tree is ⌈log 2 ⌉ + 1.The number of leaf nodes is denoted as , whose corresponding documents contain one or more interested keywords in the query.Then we have that the time complexity of NCP-DFS is ( ⋅  ⋅ log ).It is noticeable that the real search time is less than  ⋅  ⋅ log  distinctly.The reasons are as follows: (1) due to subtree pruning in advance, many leaf nodes will not be accessed even though one or more keywords are contained.(2) The accessing paths of some different leaf nodes in the same subtree share the mutual traversed parts.
(3) The documents with higher relevance scores are clustered Input:  is the root of the current searched sub-tree,  is the vector of a query and k is the number of requested documents.

Variables:
,   -a local variable for tree node; -a global variable for storing a relevance score threshold which is initialized that  = 0; -a list variable for storing the leaf node containing candidate result documents which are initialized that  = 0.
When NCP-DFS is finished,  contains the leaf nodes which stores the top- result documents.Procedures: (1)  together and more subtrees will be filtered which reduces the accessed leaf nodes furtherly.In addition, we can easily have that NCP-DFS is able to find the accurate search results.

Search Result Verification.
The network instability or potential hardware/software errors may result in tampering or even loss of documents, which could make the search results incomplete and/or incorrect.Therefore, in each leaf node of HAC-tree, we embed a digest for the corresponding document.Once a query is performed in CS, digests of the result documents are used to construct a verification object (VO) for DU.When DU receives the returned search result data from CS, DU obtains the result documents and verifies their completeness and correctness by using the verification object.
Definition 9 (document digest).One assumes that  is a leaf node of HAC-tree, whose corresponding document is   .The attribute .sig is a document digest of   and where (⋅) is a Hash Message Authentication Coding (HMAC), for example, HMAC-SHA1,   is a key only shared by DO and DU, and ‖ denotes concatenation.
Definition 10 (verification object).One assumes that the set of digests of the documents in a search result is denoted as .The verification object (VO) of the search results is constructed as follows: The completeness and correctness of the search results are verified by Algorithm 3.
Algorithm 3 indicates that, under the honest-but-curious model, if a search result is useless because one or more documents of the search result are dropped or falsified for some reasons, DU will detect it through the search result verifications.
The aforementioned works discuss the HAC-tree construction and the corresponding multikeyword text search algorithm NCP-DFS.However, these are based on plaintext.In the next section, we will use secure inner product operation to encrypt HAC-tree and give two kinds of multikeyword text search schemes over encrypted cloud data.

Secure Multikeyword Text Search Schemes
In this section, two secure search schemes, based on the framework of MUSE (as shown in Section 4.5), are given to resist the known ciphertext threat model and known background threat model, respectively.One is the basic multikeyword text search over encrypted (BMUSE) cloud data and Input:  is a set of the result documents, VO is the verification object from CS and  is the number requested documents Output: Whether  is correct and complete Variables:  is a set variable for storing document digests Procedures: (1) RETURN FALSE; // is incorrect (9) END IF (10) END IF (11) RETURN TRUE; // is complete and correct Algorithm 3: (, VO, ).
the other is enhanced secure multikeyword text search over encrypted (EMUSE) cloud data.
6.1.BMUSE Scheme.We use the BuildHACTree algorithm, NCP-DFS algorithm, VerifyResult algorithm, and secure inner product operation to construct the BMUSE scheme whose details are shown as follows.
(1 () ).In this initialization phase, DO generates the secure key SK, which is a four-element tuple as (,  1 ,  2 ,   ) where  is an -bit random vector;  1 and  2 are two random × invertible matrices;   is a key for symmetric encryption.SK is only shared between DO and DU while CS has no idea of it.
(, ).According to BuildHACTree algorithm, for each vector .VM stored in the node u, the splitting procedure is applied, and it is split into two random vectors as {   ,    } with  as a splitting indicator.The split rule is shown as The encrypted vectors of .VM are generated as {  1    ,    2    }.For each node in HAC-tree, u.VM is replaced with {  1    ,   2    }; then encrypted tree index Ĩ is generated.
(  , , ).Assume that   = { 1 ,  2 , . . .,   } is a set of interested keywords.The corresponding vector of   is generated as , each dimension of which is computed as (14).And then  is normalized.
Subsequently,  is split into two random vectors as {  ,   } with the splitting indicator , and the split rule is shown as (15).Then the encrypted vectors of  are generated as { −1 1   ,  −1 2   }.After that, DU transmits the trapdoor   to CS, which consists of { −1 1   ,  −1 2   } and .
ℎ( Ĩ,   , ).CS executes NCP-DFS algorithm to obtain the search result  and construct a verification object VO.Then CS returns  together with VO to DU.The relevance score between the encrypted vector of .VM and the encrypted vector of  is calculated by (16).Note that the relevance score calculated from encrypted vectors is equal to that of unencrypted vectors as follows: (, , ).DU utilizes the key   to decrypt the search result and verifies it through the VerifyResult algorithm.

EMUSE Scheme.
According to [3,7], BMUSE may leak the privacy of keywords.The cloud server is able to link the same search requests according to the same visited path and the same relevance scores.So CS may find some highfrequency keywords and deduce whether a certain keyword exists in a document by analyzing the correlation of trapdoors.In consideration of this, it is a practical and effective method to extend dimension by adding some phantom terms into document vectors and query vectors to break such equality, and then the keyword privacy and trapdoor unlinkability will be protected.Following the above idea, we propose the enhanced MUSE which is named EMUSE.
(, ).Through adding the phantom terms, we extend .VM from  dimension to ( + ) dimension.Each element of phantom terms is filled by a random value   ,   ∼ (  − ,   + ).Then we adopt the same vector encryption method as BMUSE to encrypt .VM for every vector in the HAC-tree.
(  , , ).We extend  from  dimension to ( + ) dimension to generate the query vector.A part of elements in the extended dimensions of  are set to 1 and the left are set to 0.
According to the detailed steps of the above schemes, we discuss the deployment of our schemes.Obviously, DO is the manager for deploying the schemes.DO generated security information SK firstly which is utilized to encrypt the documents and generate the corresponding secure index.After that the encrypted documents and index are outsourced in CS.To process secure multikeyword search, the search algorithm on the basis of the outsourced data should also be deployed in CS by DO.Meanwhile, the trapdoor generating algorithm designed by DO is granted to the authorized DU together with SK.When the authorized DU performs a search, a trapdoor is firstly generated and submitted to CS.After that, CS performs the secure multikeyword search algorithm designed by DO and returns the corresponding encrypted documents and verification object to DU.And DU applies the decryption to get the plaintext search result.

Privacy-Preserving Analysis.
In this section, we analyze the privacy-preserving property of BMUSE and EMUSE, which depends on the security of secure inner production firstly proposed in the secure -nearest neighbor (kNN) technique [13].All vectors in HAC-tree are encrypted by random invertible matrices and all documents in FS are encrypted by symmetric encryption.According to [3], CS is impossible to deduce the concrete random matrices on the basis of grasping existing data, including the encrypted documents, the encrypted HAC-tree, and the trapdoor of queries.That is to say, the keys for HAC-tree encryption are secure.Besides, the other security information in SK, such as the symmetric encryption key and splitting indicator, is only shared between DO and DU but private to CS.Thus, BMUSE can protect the confidentiality of documents, index, and queries.Therefore, BMUSE can resist the attack of known ciphertext threats.However, if the given interested multikeyword are the same, the corresponding generated trapdoors and the access paths are the same too.BMUSE cannot resist the attack of known background threats, which means that BMUSE may leak the trapdoor unlinkability [7] and keyword privacy.
In EMUSE, we add phantom terms on the basis of BMUSE, which are equivalent to adding random values to the relevance scores between documents and queries.The security is enhanced.Therefore, EMUSE is able to withstand known ciphertext threats.That is to say, EMUSE satisfies all the privacy-preserving requirements described in Section 4. 4.
Because of adding a random value  to the relevance score between a document and a query, the accuracy of the search result slightly decreases but the privacy preservation is enhanced.Therefore, we need to control  to get the balance of the accuracy and security.Practically, we set  = 0 and balance the accuracy and security by adjusting the standard deviation .

Completeness Analysis.
In the HAC-tree leaf node, .sig is document digest of document   represented by .FD constructed by the key of   and the context of   based on HMAC.Both the avalanche effect [37] and one-way irreversibility guarantee the cryptographic strength of the HMAC, thus ensuring the safety of .sig.
Assume that  = { 1 ,  2 , . . .,   } and vo are returned from CS where  is the result document set and vo is the verification object.After receiving the search result, DU recalculates the document digests as sig 1 , sig 2 , . . ., sig  and reconstructs the verification object, denoted as vo  .If the content of   is tampered into    , then vo  ̸ = vo will happen.Besides, for each document digest, it is constructed by the key of   and the plaintext of   .Both of them are private to CS.Hence, it is impossible for CS to falsify a correct document digest successfully without being detected.And MUSE is capable of supporting the search result completeness verification.

Performance Evaluation
In this section, we evaluate the performance of our proposed scheme MUSE and compare it with the schemes presented in [7,8] which are denoted as DMRS and MRSE-HCI, respectively.We implement MUSE, DMRS, and MRSE-HCI and perform evaluations on the precision (Δ  ) and search time (Δ  ) on the real data set of NSF Research Award Abstracts provided by UCI [38].We use BMUSE, BDMRS, and BMRSE-HCI to represent the basic schemes of MUSE, DMRS, and BMRSE-HCI, respectively.Meanwhile, we use EMUSE, EDMRS, and EMRSE-HCI to represent the enhanced schemes of MUSE, DMRS, and BMRSE-HCI, respectively.
The real dataset includes about 129000 abstracts, from which we random choose 20000 abstracts as our experimental data and extract about 10000 keywords.The experimental hardware environment is Intel Xeon E3-1225v, 3.3 GHz 4 core CPU, 16 G memory, and 1 T hard disk; and  software environment is CentOS6.4server operating system and Eclipse development platform.Default parameters are summarized in Table 1.Parameter  denotes the number of keywords of interest.TH donates the maximum number of cluster size in MRSE-HCI.

Search Precision Evaluation.
Due to adding phantom terms in EMUSE, the relevance score between a document and a query is affected by the random value ,  ∼ (0,  2 ).In order to evaluate the impact of  on the accuracy of the search results, we adopt the definition of precision in [3] which is defined as Δ  =   /, where   is the number of the real top- result documents that are returned by CS.
Figure 5 shows that the precisions fluctuate within a small scope when  = 0.02,  = 0.03, and  = 0.05, respectively.Note that the smaller the standard deviation  is, the higher the precision becomes.In the experiment, when  is 0.02, 0.03, and 0.05, the value of Δ  is 93.37%, 89.39%, and 81.62%, respectively.The reason is that when  grows, the impact of  on relevance score between a document and a query becomes greater.In our proposed scheme, DU can balance the precision and security by adjusting the standard deviation .The default value of  is 0.02 in the experiment.
As shown in Figures 6(a) and 6(b), for MUSE, DMRS, and MRSE-HCI, Δ  does not change significantly, when  increases from 20 to 200.However, no matter with or without adding phantom terms, the MUSE's Δ  is significantly larger than MRSE-HCI and approximately equal to DMRS, which means that MUSE outperforms MRSE-HCI in terms of accuracy.Δ  of BMUSE is 100% which is the same as BDMRS because both BMUSE and BDMRS perform the exact calculation of document vectors and query vectors.The precision of BMRSE-HCI is about 81.71% which is obviously lower than the others.The reason is that BMRSE-HCI clusters the relevant documents into the same cluster by dynamic -means during index construction, and then the search involves only the most relevant subclusters, which will cause some more relevant documents to be ignored, and inaccurate search results are generated.Due to adding phantom terms, the EMUSE's Δ  is no longer 100% and about 93.37%.Compared with EMRSE-HCI, the EMUSE's Δ  is still 16.52% higher.The reason is the same as the basic scheme.In conclusion, the experimental result demonstrates that MUSE can achieve higher accuracy than MRSE-HCI and the similar accuracy with DMRS both in the basic and in enhanced scheme.

Search Time Evaluation.
For the convenience of description, we use MRSE-HCI-100 and MRSE-HCI-300 to represent the maximum numbers of cluster size TH = 100 and TH = 300 in MRSE-HCI, respectively.In this section, we evaluate and analysis the impact of , ,  on Δ  in the basic and enhanced scheme.
(1) The Impact of .As shown in Figures 7(a) and 7(b), the search time Δ  of MUSE is obviously lower than DMRS in both basic and enhanced scheme.Figures 7(a) and 7(b) both reveal that Δ  in MUSE and DMRS increase with  growing up.In addition, Δ  of MUSE is about half of DMRS for the same dataset.The reason is that the documents with high relevance are clustered in HAC-tree.HAC-tree is helpful to maintain and utilize the category relationship to perform efficient and accurate text search on encrypted documents.This feature can ensure the efficiency of NCP-DFS algorithm.However, DMRS does not use the relationship between documents, so it is obviously slower than MUSE.
Besides, as shown in Figures 7(a) and 7(b), Δ  of MRSE-HCI grows very slowly as  grows.For the basic and enhanced schemes of MRSE-HCI, the value of TH has great influence on search efficiency.When TH is 100, Δ  of MRSE-HCI is lower than others.And when TH is 300, Δ  of MUSE is firstly lower than MRSE-HCI but the former becomes larger than the later as  grows.The improvement of efficiency in MRSE-HCI is at the expense of sacrificing the search result precision as shown in Figures 6(a In addition, we have that the plot of EMUSE is very likely to the plot of BMUSE, and the search time of EMUSE is slightly higher than BMUSE, about 5.67%.It indicates that   there is just a little impact on the search time by adding phantom terms into vectors. (2) The Impact of .As shown in Figures 8(a) and 8(b), no matter before or after adding phantom terms, Δ  of all schemes increase as  grows.The reason is that more result documents will be determined as  grows, which consumes more time.The increments of Δ  in MUSE and DMRS are more obvious than MRSE-HCI.The search times of BMUSE and EMUSE are about 47.71% and 40.30% less than BDMRS and EDMRS, respectively.While Δ  of MRSE-HCI is less than the other schemes.But the better performance   in search efficiency of MRSE-HCI sacrifices the search result accuracy, and the reduction of accuracy may not satisfy users' expectations.
(3) The Impact of .According to Figure 9, Δ  of MUSE, DMRS, and MRSE-HCI have little changes no matter with or without adding phantom terms.This means that the search times of MUSE, DMRS, and MRSE-HCI are affected slightly by the number of keywords of interest.As shown in Figure 9(a), in terms of efficiency, BMUSE is better than BDMRS and saves about 51.12% in Δ  .It also reveals that BMUSE is superior to BMRSE-HCI-300 and inferior to Security and Communication Networks BMRSE-HCI-100.BMUSE puts accuracy as a prerequisite to obtaining efficiency as much as possible.In contrast, BMRSE-HCI-100 improves the efficiency of result retrieval by the cost of reducing accuracy.In Figure 9(b), Δ  of EMUSE is almost a half of EDMRS but during between EMRSE-HCI-100 and EMRSE-HCI-300.The reason is the same as the basic scheme.
According to (1), (2), and (3), the experimental results demonstrate that MUSE can achieve higher search efficiency than DMRS.While MUSE may be less efficient than MRSE-HCI depending on the setting of the maximum number of cluster size parameter in MRSE-HCI.
Compared to DMRS and MRSE-HCI on a real dataset, the experimental results show that (i) MUSE outperforms DMRS in terms of efficiency and achieves the similar accuracy with DMRS; and (ii) MUSE consistently gives more accurate results than MRSE-HCI while it may be less efficient than MRSE-HCI, depending on the choices of parameter values in MRSE-HCI.

Conclusion
The multikeyword text search over encrypted data is a common problem in cloud computing services.There are urgent requirements in such important fields as e-mail, electronic medical records, and bank transaction information, which is a hot issue in cloud computing service.It is challenge to ensure the efficiency of the search under the premise of ensuring the accuracy of the search results.In this paper, we propose an efficient and accurate verifiable privacypreserving multikeyword text search over encrypted cloud data based on hierarchical agglomerative clustering called MUSE.According to hierarchical agglomerative clustering, we construct a binary tree structure as the index, which is named HAC-tree.Based on HAC-tree, we provide a noncandidate pruning depth-first search algorithm to improve search efficiency.Besides, we utilize secure inner product algorithm against two threat models.The experimental results show that the scheme proposed in this paper has better comprehensive performance in terms of efficiency and accuracy compared with the existing methods.

Figure 2 :
Figure 2: Framework of the secure multikeyword search.

Figure 5 :
Figure 5: The impact of  on precision by choosing different standard deviation .
Top- search with adding phantom terms

Figure 6 :
Figure 6: The impact of  on the precision of search results.

Figure 7 :
Figure 7: The impact of  on search efficiency.

Figure 8 :
Figure 8: The impact of  on search efficiency.

Figure 9 :
Figure 9: The impact of  on search efficiency.
(1)er the vector space model, each document   is denoted as a -dimensional vector   .Any element  []in   stores the normalized TF value of the keyword   whose calculation formula is shown in(1).Similar to documents, the interested keywords   are also denoted as a -dimensional vector, named the query vector , whose element [] stores the normalized IDF value of the keyword   in   .The calculation formula of [] is shown in ,  is the TF value of   in   , TF ,  = (1+ln  ,  )/ |  |, IDF   is IDF value of   in FS, IDF   = ln(1 + /   ),  ,  is the frequency of   that appears in   , and    is the number of documents containing   in FS.