A Multibranch Search Tree-Based Multi-Keyword Ranked Search Scheme over Encrypted Cloud Data

In the interest of privacy concerns, cloud service users choose to encrypt their personal data before outsourcing them to cloud. However, it is difficult to achieve efficient search over encrypted cloud data. Therefore, how to design an efficient and accurate search scheme over large-scale encrypted cloud data is a challenge. In this paper, we integrate bisecting k-means algorithm and multibranch tree structure and propose the α-filtering tree search scheme based on bisecting k-means clusters. The novel index tree is built from bottom-up, and a greedy depth first algorithm is used for filtering the nonrelevant document cluster by calculating the relevance score between the filtering vector and the query vector. The α-filtering tree can improve the efficiency without the loss of search accuracy. The experiment on a real-world dataset demonstrates the effectiveness of our scheme.


Introduction
Cloud computing is a new model in IT enterprise which can offer high-quality calculation, storage, and application capacity. e cloud customers choose to outsource their local data and computation to the cloud server for minimizing the data maintenance cost. us, to protect users' privacy and achieve efficient and precise data retrieving from the cloud server has become the focus of recent works. e traditional way to protect data privacy is to encrypt the original data. However, this is a very challenging task for data utilization. e search schemes based on ciphertext [1][2][3][4][5][6][7] can guarantee the data privacy but the search algorithms have high time and space complexity, which are not suitable for cloud data retrieval. To solve this problem, researchers proposed a series of searchable encryption schemes [8][9][10][11][12][13] based on the theory of cryptography. ese encryption schemes either do not have high-accuracy retrieval results [8][9][10]12] or cost a lot of time and space overhead [8,11]. erefore, it is a necessity to design an efficient and useable search scheme.
In this paper, we propose an α-filtering tree search scheme based on bisecting k-means clusters, which achieves efficient multi-keyword ranked search over encrypted cloud data. We use vector space model and TF-IDF model to build the keyword dictionary and transform the documents and keywords into "points" in a multidimensional space that can be described by vectors, and then we used the secure inner product to encrypt the document vectors and query vectors. e relevance scores between the document vectors and query vectors are used to obtain the top-k most relevant documents.
Our paper's main contributions are summarized as follows: (i) We integrate the bisecting k-means algorithm and a multibranch tree structure where the bisecting kmeans algorithm is used to improve the cluster accuracy, and we propose an α-filtering tree search scheme based on the bisecting k-means clusters. (ii) We propose a greedy depth first algorithm to achieve searches on the α-filtering tree, which improves the multi-keyword search efficiency. By adopting the secure inner product encryption scheme, we achieve the privacy-preserving ranked search on the encrypted α-filtering index tree. (iii) We perform the experiment on a real-world dataset and compare with existing schemes in terms of retrieval efficiency and index storage usage. e result shows that our scheme is superior in search efficiency and storage usage. e rest of the paper is organized as follows: Section 2 introduces the related work, Section 3 introduces the main background knowledge, and Section 4 gives a brief introduction to our system model, threat model, and design goals. e constructions of the α-filtering tree and search algorithm are presented in Section 5. Sections 6 and 7 give the experiment result and its analyses. Finally, the conclusion is given in Section 8.

Related Work
Searchable encryption schemes implement keyword searches over encrypted outsourced data, which allow users to store their personal data on the cloud server without privacy concerns. Recently, an increasing number of scholars conduct research in this area. We discuss the related work on the development of searchable encryption schemes' performance and function.

Single-Keyword Searchable Encryption.
Song et al. [14] first proposed a symmetric encryption search scheme, and they encrypted each keyword in the document set separately and searched the entire data set by sequential scanning. us, the search time of the scheme is linear with the overall size of the document set. Goh [15] proposed a searchable encryption scheme based on Bloom filter. ey achieved search efficiency, that is to say, the calculation overhead is not related to the number of documents in the dataset. However, due to the probability of a false positive for the Bloom filter, the cloud server may return documents that do not contain search keywords. e scheme in Chang and Mitzenmacher [16] used two indexes. e first index is to store and manage a premade dictionary by the user. e second requires twice the interactions between the user and the cloud server, which affects the user experience but it can achieve the same search efficiency as Goh [15]. Curtmola et al. [17] adopted two novel search schemes SSE-1 and SSE-2. SSE-1 is used to prevent chosenkeyword attacks (CKA1), and SSE-2 is against adaptive chosenkeyword attacks (CKA2). eir schemes' search time cost is proportional to the number of keywords retrieved. Boneh et al. [18] adopted a searchable encryption structure that allows everyone to store their data with the public key. But their scheme needs large amount of calculation.

Multi-Keyword Search Schemes.
Multi-keyword searchable encryption allows the user to submit multiple search keywords to retrieve the most relevant documents. ey can be further classified into ranked search and traditional search. In traditional search, most schemes are conjunctive keyword search which returns all the documents containing the search keywords, and conjunctivesubset keyword search which returns the documents containing the keyword subset. However, traditional search is not suitable for the ranked search. Cao et al. [8] first achieved a privacy-preserving multi-keyword ranked search scheme. In their scheme, documents and search keywords are described by the dictionary-scale vectors. e scheme uses coordinate matching to rank the documents. Since the weights of different keywords in documents are not considered, the retrieval result obtained by the scheme lacks accuracy, and the search time of the scheme is linear with the scale of the dataset. Sun et al. [9] proposed a novel multi-keyword ranked scheme; they used TF-IDF vector space model and cosine distance measurement to build an index tree structure. e experiment shows that their scheme is more efficient than linear search but lacked accuracy. Orencik et al. [10] adopted Locality-Sensitive Hashing to cluster the similar documents, but their ranked search result is also not accurate. Xia et al. [11] adopted the vector space model and KBB tree to build a dynamic multi-keyword ranked search scheme, which more precisely obtains the ranked result. However, as the scale of the documents increases, the index tree space cost is large, and the pruning effect of the search algorithm is also reduced, resulting in a decrease in search efficiency.
To enhance searchable encryptions' usability and functionality, many schemes that support fuzzy keyword search [19][20][21][22][23], conjunctive keyword search [3,[24][25][26], and similarity search [27][28][29][30] have also been presented. e dynamic scheme can support updates on the dataset, which largely enhances searchable encryptions' usability. e first dynamic searchable encryption scheme is proposed by van Liesdonk et al. [31], which supports a limited number of updates. After their work, many dynamic searchable encryption schemes are proposed [32][33][34][35]. e verifiable scheme can check the integrity of search results when the cloud server is not honest. Many researches are conducted to support verifiable searches in [26,36,37]. To extend the searchable encryption scheme to support other data types like multimedia data, some research works were also proposed [38,39].
In Xia et al.'s work [11], they presented an efficient search index tree to obtain the search result. However, as the scale of the document increases, the index tree space cost is large and the pruning effect of the search algorithm is also reduced, resulting in a decrease in search efficiency. us, we proposed a multibranch index tree to overcome this problem. By adopting the clustering algorithm over the document set, we can further increase the search efficiency. Moreover, the multibranch tree can also save the space cost for the index tree.

Vector Space
Model. Among many information retrieval models, the vector space model is the most popular method of relevance measurement and we adopt the TF-IDF model for feature extraction. It is widely used in plaintext multikeyword retrieval. TF (term frequency) refers to the word frequency, that is, the number of occurrences of the keyword w in the document f divided by the total number of words |f| contained in the document f. IDF (inverse document frequency) indicates the inverse document frequency, that is, the number of documents divided by the number of documents containing the keyword. e keyword dictionary is first generated by filtering the stop words form all the words contained in the document set D. en, the document vector F V and the query vector q are generated according to the keyword dictionary W. e dimension of F V and q is equal to the scale of the keyword dictionary; each dimension represents a corresponding keyword w i . e value of each dimension in F V means the normalized TF value and normalized IDF value in q. e TF value and the IDF value of the keyword w i are calculated as follows: where IDF w ′ � ln(1 + N/N w ) and TF f,w ′ � N f,w /|f|.

Relevance Measurement.
e inner product operation is performed by two equal-length vectors, and the relevance between two vectors is quantized by the inner product score. e larger the score, the higher the relevance between the two vectors. e relevance score is calculated as follows: (2) We make the following instructions about equation (2): is the relevance score between the document and the search keywords. (ii) IF F V is a filtering vector of the index tree node and V Q is a search vector, Score (F V , V Q ) is the relevance score between the upper bound vector of the documents stored in this node and the search keywords.

Bisecting k-Means Cluster.
In data mining, the bisecting k-means algorithm is a cluster analysis algorithm. By selecting 2 initial centroids in a bisecting k-means algorithm, each point is assigned to the nearest centroid in turn, and the points that are assigned to the same centroid form a cluster. e centroids of each cluster are continually updated by different points assigned to the cluster, assignments and updates are repeated until the clusters no longer change, and then the clustering algorithm is completed. We use the cosine distance to measure the distance from the point to the centroid, which is defined in the following equation: where x → is the point's vector, y → is the centroid's vector, and ‖ x → ‖ and ‖ y → ‖ are the norms of x → and y → .

Secure Inner Product
Operation. e special matrix encryption proposed in [8] can achieve privacy-preserving vector inner product. Assuming that p and q are two ndimensional vectors, the user encrypted them to p and q by calculating M T p and M − 1 q, where M is a random n × n invertible matrix. erefore, we can get the inner product of the original vectors only by the inner product of their encrypted form p · q as follows: (4)

System Model.
In this paper, there are 3 entities in our system model: data owner, data user, and cloud server as shown in Figure 1. ese three entities collaborate as follows.
e data owner has the local dataset D and wants to outsource them in secure form to the cloud server while still providing the search service for users. In our scheme, it first generates the searchable index tree I according to D. en, it uses the secure key to encrypt both D and I to its encrypted Security and Communication Networks form D and I. After that, it shares the secure key with the data user through the access control and outsources D and I to the cloud server.
e cloud server provides both storage service and search service. It stores the secured index tree I and encrypted document set D. After it receives the search trapdoor T Q from the data user, it performs the secure search by using I and returns the search result to the data user. e data user is the authorized one to access the document set. It generates the search trapdoor T Q with search keywords Q through the proposed search scheme and sends T Q as the search request to the cloud server. After receiving the search result, it uses the secure key to decrypt the encrypted documents and get the plaintext documents.

reat Model.
We adopt the same "honest-but-curious" threat model as the current work [8,9,11,40,42]. at is to say, the cloud server follows the user's instruction honestly and precisely, but it could curiously analyze the received data to obtain additional information about the dataset. Two threat models were proposed by Cao et al. [8] and are adopted in our work as follows: Known Cyphertext Model: the cloud server could access the cyphertext dataset, the encrypted index tree, and the search trapdoor, and thus the cloud server can conduct the cyphertext-only attack. Known Background Model: the cloud server could have more dataset-related information than the known cyphertext model in this stronger model. e cloud server can have statistical information about the relation between the search trapdoor and the search result.
en, it could infer or recognize some of the search keywords in the trapdoor by the additional information it has.

Design Goals.
To ensure the privacy, efficiency, and accuracy in the multi-keyword ranked search over encrypted cloud data, our system design should meet these requirements as follows: Search Efficiency: compared with other multi-keyword search schemes, the proposed search scheme should be superior in efficiency than others. Search Accuracy: the proposed search scheme should guarantee the accuracy of the search result. Privacy Persevering: the proposed scheme should ensure the privacy of the document privacy, index privacy, trapdoor privacy, trapdoor unlinkability, and keyword privacy in the search process.

Index and Search Algorithm
In this section, we mainly discuss the index construction method and search method based on the index tree and then we give the corresponding algorithms. We first construct a document atom cluster list by using the bisecting k-means algorithm. en, based on the generated atom cluster list, we build the α-filtering tree and then propose a corresponding greedy depth first search algorithm for multi-keywords ranked search.

Atom Cluster List Generation Algorithm.
Considering the document set D as the input raw cluster, we use the bisecting k-means algorithms to perform top-down bisecting clustering until all the generated subclusters contain less than μ documents in Algorithms 1 and 2, and thus a binary clustering tree is built as shown in Algorithm 2. Here, μ is the given threshold for clustering. en, we traverse the leaf clusters in the generated binary clustering tree, and the atom cluster list L is constructed in Algorithm 3, which is used for building the α-filtering index tree.
Definition 1. Atom Cluster. e leaf clusters in the binary tree generated by Algorithm 1 are the atom clusters, where the number of documents in each atom cluster is no more than μ.
Assuming that the list of the atom clusters generated by Algorithm 1 is L � {C 1 , C 2 , . . ., C t }, we have the following properties We illustrate the generation process of the atomic cluster list L in Algorithms 1-3 by an example. We assume that the document set is D � {d 1 , d 2 , . . ., d 15 } and μ � 3. e first round of bisecting clustering is performed on D, and two subclusters are generated as shown in Figure 2. With the same process, the second layer's and the third layer's subclusters are all sequentially divided into two clusters, and the subcluster stops clustering when the number of documents contained in the subcluster is less than or equal to 3. Finally, a binary clustering tree is formed, where the leaf nodes are 15 }, as shown in Figure 2. en, the algorithm traverses the leaf nodes of the binary clustering tree in the middle order and then the atom cluster list L � {C 1 , C 2 , C 3 , C 4 , C 5 , C 6 } is generated.

α-Filtering Tree
Definition 3. α-Filtering Tree. A node u in the α-filtering tree is a triple, which is denoted as where u·FV is a n-dimensional filtering vector, u·PL is a child node pointer which have at most α pointers, and u·DC stores documents when u is a leaf node.
(1) If u is a leaf node, then u·PL � ∅, We give the construction procedures of the α-filtering tree in Algorithm 4.
Algorithm 4 builds the α-filtering tree with the atom cluster list. Tree nodes are created during each round processing of steps 8-21. e original atom cluster list is treated as the first child node list (CNL). In each round, α nodes are fetched from CNL once a time and a parent node is created to have these nodes and added into the parent node list (PNL). After all the nodes in CNL have been fetched and then we have the complete parent node list (PNL) in this round. If we have more than 1 node in PNL, then we move all nodes in PNL to CNL. Otherwise, the only node in PNL is the root of the generated index tree.

Theorem 2.
e height of an α-filtering tree with t leaf node is ⌈log α t⌉ + 1.
Proof. We assume that the length of the atom cluster list L is t, that is, the number of leaf nodes of the α-filtering tree is t. According to Algorithm 4, after the 1 st , 2 nd , . . ., x th rounds of processing, the number of current generated parent nodes becomes ⌈t/α⌉, ⌈t/α 2 ⌉, . . . , ⌈t/α x ⌉. When the number of current generated parent nodes is 1, the construction of the α-filtering tree is finished, so there is ⌈t/α x ⌉ � 1. en, we deduce x � ⌈log α t⌉. Since the height of each tree is increased by 1 for each merge and the initial height of the tree is 1, the height of the α-filtering tree with t leaf node is ⌈log α t⌉ + 1. Definition 4. For a query Q whose vector is V Q and two nodes u and u', if Score (V Q , u·FV) ≥ Score (V Q , u'·FV), then u has higher or equal relevance score with Q than u' which is denoted as Theorem 3. We assume that u � < FV, PL, DC > is a nonleaf node in the α-filtering tree and u·PL stores g child nodes, i.e., u·PL � {u·PL [1], u·PL [2], . . ., u·PL[g]} and 1 ≤ g ≤ α. For a query Q, we have Proof. To prove ∀u ′ ∈ u · PL ⟶ u ▷ u ′ , that is to prove Score(V Q , u · FV) ≥ max Score(V Q , u · PL[1]· FV), Score(V Q , u · PL[2]· FV), . . . , Score(V Q , u · PL[g] · FV)}. Because every elements in an n-dimensional filtering vector u·FV are generated by the following equation:

Security and Communication Networks
Input: e document set, D; e threshold of the maximum number of documents in an atom cluster, μ; Output: An atom cluster list, L; (1) Create a root cluster node r which has all documents of D; (2) GenBiSectingTree (r, μ); us, Score (V Q , u·FV) is not less than the relevance scores between any child nodes' filtering vector and the query vector. en we have, ∀u ′ ∈ u · PL ⟶ u ▷ u ′ .
During the search process, for a given Q, if the relevance score between a subtree's root node filtering vector and the corresponding query vector is not higher than the threshold of the candidate result list, then all its child nodes are noncandidates according to eorem 3. us, we can directly ignore this subtree and the search efficiency is improved, which is the pruning criterion of greedy depth first search algorithm. Adopting the idea, we propose a greedy depth first search algorithm shown in Algorithm 5.
In Figure 3, we construct a 3-filtering index tree example to further illustrate multi-keyword ranked search algorithm. e index tree is built any child nodes' filtering vector and the query vector after the leaf nodes are generated from the atom cluster list. e intermediate nodes are generated based on the leaf nodes. We assume that the query vector is V Q � (0.5, 0.5, 0, 0) and the top-3 ranked documents are interested. When the search starts, the algorithm first visits the left subtree of u 11 , u 21 , and u 31 recursively and finds that u 31 is a leaf node which has 3 documents. e algorithm puts all the documents into the result list RL � {d 1 , d 2 , d 3 }, where the relevance scores are 0.3, 0.35, and 0.3, respectively. en accesses u 32 is accessed, and the relevance score between its filtering vector and the query vector is 0.2 which is less than 0.3; therefore, RL remains unchanged. After that u 33 is accessed with the relevance score 0.35, so d 9 and d 10 are added to RL, replacing d 1 and d 3 . Finally, the algorithm searches the subtree rooted by u 22 and finds no need to search the remaining subtree. e search algorithm is finished.

Effective and Secured Multi-Keyword Ranked Search Scheme
In this section, we construct the secure search scheme by using the secure kNN algorithm [41]. e data owner constructs the index tree with the document set and then uses the secure keys to encrypt the document sets and index tree, respectively. e data user submits search request to the cloud server by using query keywords. e cloud server performs search algorithm on the index tree and returns the search result documents.

Security and Communication Networks
documents and index tree. Here, g is the secure symmetric encryption key for document encryption and is only shared with the data user but protected from cloud server. S is a bit vector for vector splitting, and each dimension of S is randomly chosen to be 0 or 1 and the number of 0 and 1 should be nearly equal. M 1 and M 2 are both n × n-dimensional randomly generated invertible matrices.

BuildIndex (D, SK).
e data owner first performs index tree construction algorithms discussed in Sections 5.1 and 5.2 to generate the plaintext index tree I on the documents in D. en, the data owner encrypts the index tree to its encrypted from I. Specifically, for each document vector and each node's filtering vector, we use the bit vector S to split them into two vectors. For simplicity, we use V to represent one of these vectors, and the splitting procedures are as follows: en the data owner encrypted the split vectors to After that, the data owner encrypts the documents in each leaf node's atom cluster by secure key g, and the encrypted index tree I is generated. Finally, the data owner outsources I to the cloud server.  Input: e root node of an α-filtering tree, r; e query vector of Q, V Q ; e number of requested documents, k; e minimum of the relevance scores between documents in RL and Q, λ; e list for storing top-k ranked documents, RL; Output: RL;

Atom cluster Document
(1) u � r; (2) if u is a leaf node then (3) Add all the documents of u·DC in RL; (4) if |RL| > k then (5) Set the threshold λ equals the minimum of the relevance scores between the candidate documents in RL and V Q ; (6) Remove the documents from RL, the relevance scores between which and V Q are smaller than λ; (7) end if (8) else (9) if Score (V Q , u·FV) > λ then (10) for each u' in u·PL do (11) SearchIndex (u', V Q , k, λ, RL); (12) end for (13) end if (14) end if ALGORITHM 5: SearchIndex (r, V Q , k, λ, RL). 8 Security and Communication Networks

GenTapdoor (Q, SK).
e data user generates the query vector V Q according to the query keywords in Q. en the secure key SK is adopted to generate the corresponding trapdoor T Q . e generation of T Q is similar to the encryption procedures of document vectors. First, V Q is split into two vectors according to the following equation en, the data user encrypted the split vectors into the trapdoor Finally, T Q is submitted to the cloud server as the search command.

SearchIndex (I, T Q , k).
e cloud server receives the trapdoor T Q , and performs search algorithm on the secure index tree I. en, the cloud server returns the encrypted top-k documents result list RL to the data user who decrypts the encrypted documents and the search processing is finished. e special matrix encryption can obtain the inner product of two vectors only with the inner product result of their encryption forms, which is illustrated as follows: To protect the Trapdoor unlinkability and keyword privacy under known background model, we should prevent the server from calculating the exact value of the relevance score between the T Q and F V which can leak TF distribution information.
us, we add some phantom terms [11] on the vectors generated in our scheme to disturb the relevance score calculation. But the search accuracy would decrease.
In the enhanced scheme, we generate (n + n') × (n + n')dimensional secure matrices and also the document vectors will be extended to n dimensions. e extended elements F V [n + i] are set to a random number β. Similarly, the query vector is also extended to be a n + n' dimensional vector, and the extended elements are random set to 1 or 0. us, the relevance score between the query trapdoor and document vector is equal to F V · V Q + β i , where V Q [n + i] � 1. e randomness of β i can ensure the privacy against the known background model.

Security Analysis.
In this paper, we construct the treebased secure search scheme same as [11,42] to achieve searchable encryption, which represents the security of our scheme should be the same as [11,42]. We give the proof briefly as follows: (i) Document privacy: we use the traditional symmetric encryption on documents before outsourcing to the cloud server. As long as the secure key is secured against the adversary, the document privacy is protected in our scheme.
(ii) Index and trapdoor privacy: the document vectors and query vectors store the TF and IDF value of the corresponding keywords and encrypted with the secure matrices generated by secure kNN after being randomly split. e secure matrices are both randomly generated invertible matrices. e adversary cannot calculate the secure matrices only with the encrypted vectors. erefore, the index and trapdoor privacy is protected in our scheme.
(iii) Trapdoor unlinkability: the query trapdoor is randomly split by the split vector S for each search, and the trapdoors are different with same search requests. us, the trapdoor unlinkability is guaranteed. But, the cloud server can link the same search requests by inferring the access pattern and the ranked result of the searches. To solve this problem, we can expand the vectors used in our secure scheme by adding phantom dimension to interference the relevance score. With phantom terms, the search results in same requests could be different. However, the search accuracy can be decreased and the balance between the privacy and accuracy is discussed in [11].
(iv) Keyword privacy: the index and trapdoor privacy is protected in our scheme which means keyword privacy is also protected in the known cyphertext model. In the known background model, the relevance score between the documents and the query vector can leak the TF information about the query keywords. If a search request only has one search keywords or one of the search keywords has high TF value, the cloud server can easily infer this keyword by its statistical information about TF distribution of keywords. Similarly, to solve this problem, we add phantom terms to obfuscate the relevance score between the query trapdoor and the document vector. at is to say, the TF-IDF value is variable with different search requests. us, the cloud server cannot link the keywords with their TF distribution, and the keyword privacy is enhanced.

Performance Analysis
We evaluate the performance of our α-filtering index tree scheme in this section and compare it with Xia et al.'s index tree scheme [11] and Zhu et al.'s HAC-tree [42] under different settings. We use a real-world dataset which has 120000 documents in total and implement our scheme using Java in Windows 10 with an Intel Core i5-6200U @ 2.30 GHz Security and Communication Networks 9 CPU, and the default parameter setting is shown in Table 1. k, μ, |Q|, α, and m are number of required documents, document threshold in each atom cluster, number of search keywords, number of α, and number of documents, respectively. In the enhanced scheme, we add phantom terms to enhance the security of our scheme. e search accuracy and efficiency of these two schemes are the same without the phantom terms. We only perform evaluation on original scheme for simplicity. e influence of phantom terms is discussed in [11].

Space Usage Evaluation.
In this section, we conduct the space analysis of the different schemes from the aspect of the index tree. We only discuss the index tree space usage; therefore, the search parameters are not changed. e space usage of Xia is the same as Zhu because they both are binary tree with same number of nodes.

Space Usage versus μ.
We change the document threshold μ in each cluster to compare the space usage of three schemes. Figures 4(a) and 4(b) shows the index tree space cost when the number of documents is 20000 and 120000, respectively. e result shows that as the scale of the document set increases, the space usage of index tree is significantly increased. e reason is that more tree nodes are added to the index tree to store more documents. e result also shows that larger threshold can save the space usage of the index tree which will reduce the nodes in the α-filtering tree.

Space Usage versus α.
We change α of the α-filtering tree to compare the space usage of three schemes. e result shows that an appropriate setting of α can largely save the space usage of the α-filtering tree. But when α is too large, the index tree will save space usage with more nodes having the same parent node and tends to be stable.

Index Building Time Cost Evaluation.
In this section, we evaluate the time cost of the index building. We measure the time cost of the BuildIndex algorithm of our scheme, which is shown in Table 2, given m � 20000. e BuildIndex algorithm in our scheme takes hours, while in Xia's scheme, it takes seconds. It should be noted that the key extraction and TF-IDF calculation are the same in all three schemes. And, the tree construction algorithm also has almost the same time cost because the basic structure of the tree is the same. e main difference of the time cost is that our tree uses the clustering algorithm to further improve the search efficiency in search algorithm. e clustering algorithm can consume a lot of time, which leads to worse index building time cost.
But, it can be improved by adopting more efficient clustering algorithms such as distributed clustering algorithm. e longer time cost for index building is affordable because it only needs to be performed once while providing more efficient searches.

Search Time Cost Evaluation.
In this section, we evaluate the time cost of the search efficiency of different schemes. Each data point in the figure is at least performed 10 times.
(1) Time cost versus μ. Figure 6 indicates that our scheme is better than the existing schemes in the search process. e α-filtering tree can improve the search process by accelerating the process of finding the leaf nodes, shortening the height of the tree, and accessing more nodes by an intermediate node. e k-means cluster can gather similar documents closely in leaf nodes which can fill the candidate result list reasonably. But when u increases, the time cost of our scheme tends to increase simultaneously; the reason is that the number of documents in a leaf node is increased which will slow down the relevance calculation process in the leaf node. e Xia's and Zhu's trees increase largely when the scale of document set increases; the reason is that their schemes are both binary tree in which the height of the tree increases larger than the α-filtering tree in our scheme. (2) Time cost versus α. Figure 7 indicates that the time cost of our scheme is lower than the existing schemes. As mentioned above, appropriate setting of α can improve the performance of our scheme. But when α is too large, the pruning function in an intermediate node will require more calculation and the pruning effect could be worse for there are fewer subtrees to be pruned. (3) Time cost versus |Q|. Figure 8 shows that the number of search keywords will slow down the search process of tree based index scheme. But overall, our scheme outperforms other schemes by the contribution of the α-filtering tree. (4) Time cost versus k. Figure 9 shows that under different setting of k, the time cost of our scheme is better than Xia and Zhu. When k increases, the time cost of tree-based schemes increase slightly. e reason is that the pruning function in tree index can save the times of calculation between documents and query vector.

e setting of α.
e experiment shows that different settings of α result in different improvements in our scheme. But it is hard to find an appropriate α for a tree with m nodes. e space usage of α-filtering tree decreases as α increases. However, the search time cost can increase as α increases, and it is worst when α � m. When search algorithm iterates every node in the tree, and the filtering vector in the only non-leaf node cannot help to filter the noncandidate nodes. e best α-filtering tree should balance between the width and depth. An α-filtering tree should at least have a depth of three to have the filtering vectors work. e B+ tree [43] is a multibranch tree widely adopted for storing index for large data, and arguably degree of a B+ tree is usually set to the result of the block size divided by the key size [43] in real circumstances when it stores index for a much larger dataset than that we used in experiment. ese settings can help to define an appropriate α. e search time complexity of the α-filtering tree is O(log α m), which means that the tree with less depth has better search efficiency. However, the filtering vector in a shorter tree will filter fewer nodes than a tall tree with more filtering vectors, which results in worse search efficiency. us, it is hard to define the best setting of α when given different m and it needs further discussion.

Conclusion
In terms of the efficiency problem of privacy-preserving multi-keyword ranked search, we propose an α-filtering tree index search scheme based on bisecting k-means clusters. e scheme utilizes the characteristics of a multibranch tree, which greatly reduces the spatial complexity of the index tree. At the same time, the idea of clustering is used to store the related documents closely in the index tree, which greatly improves the pruning algorithm on the index tree, thus improving the search efficiency. In contrast, since the index tree nodes are stored in the form of clusters and the clustering of the bisecting k-means requires a large amount of time, the variability of the index tree could be limited. e experiment results on the real-world dataset show that, to a certain extent, our scheme can greatly improve the search efficiency of privacy-preserving multi-keywords ranked search and at the same time guarantee the accuracy of the search results.

Data Availability
e text data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.