Currently, searchable encryption becomes the focus topic with the emerging cloud computing paradigm. The existing research schemes are mainly semantic extensions of multiple keywords. However, the semantic information carried by the keywords is limited and does not respond well to the content of the document. And when the original scheme constructs the conceptual graph, it ignores the context information of the topic sentence, which leads to errors in the semantic extension. In this paper, we define and construct semantic search encryption scheme for context-based conceptual graph (ESSEC). We make contextual contact with the central key attributes in the topic sentence and extend its semantic information, so as to improve the accuracy of the retrieval and semantic relevance. Finally, experiments based on real data show that the scheme is effective and feasible.
In 2000, digital storage accounted for only 1/4 of the world's data, and another 3/4 of the information was stored in newspapers, books, and other mediums. But by 2020, digital information will account for 4/5 of global data and will reach 40ZB, which is equivalent to 5200GB of data generated by each person. The consumption of the local storage of the users is too expensive. So in order to save storage costs of data, users usually choose to upload data to the cloud. However, public clouds are not always trusted, so the data always is encrypted before uploading to the cloud servers, which also makes the traditional plaintext search scheme invalidated. Thus, how to better protect and utilize user privacy in cloud computing has become a major research issue in mobile cloud computing.
Searchable encryption of the cloud server has become an important field of investigation in recent years. One of the most popular methods of traditional schemes is keywords-based search. The data owner first extracts the corresponding keywords for the data documents and builds the corresponding index and then outsources the encrypted documents and index to the cloud server. When searching for the encrypted data, the cloud server can match the trapdoor with the encrypted index; then the corresponding data documents are returned to the data user. But, as we know, there are some deficiencies with the above keywords-based schemes, which cannot reflect the user's search intention and the semantic information of the document.
In the keyword-based encryption search schemes, the data owner summarizes a document’s content into some keywords, which can make search matching efficient and simple. However, the keyword cannot represent the contents of the data document well; it ignores the semantic information of the document. Thus, the returned search results from cloud server are not always matching with the requirement of the user’s query. Although the keywords-based schemes [
Therefore, under protecting the security of user privacy in the cloud environment, in order to improve the relevance of documents information obtained by encrypted search, we proposed a searchable encryption scheme which combined the local features with the context similarity.
In the paper, we propose a semantic search encrypted scheme based on conceptual graphs of context (ESSEC). We still extract the central content of the whole document as the index rather than keywords and then construct the corresponding weighted conceptual graph [ We extend the context-based semantics of the center concept attribute, so that the generated conceptual graph can contain the content information of the document and constructs the semantic network of the conceptual graph, which helps to make search results satisfy the needs of users’ retrieval as much as possible. The experiments based on real datasets have been implemented, and the experimental contrast diagrams make clear that the two schemes put forward in this paper are effective and feasible.
Searchable encryption [
Cao et al. [
Then, the scholars have put forward many excellent schemes based on semantic searchable encryption [
The system model considered in this paper is shown in Figure
System model.
The data users need to obtain the authorization from the data owner. Then they need to generate request trapdoor (conceptual graph)
In our scheme, we think the cloud server is “honest but curious.” In other words, the cloud server can comply with the protocols, but it still hopes to obtain more sensitive information through learning and guessing. In this paper, we only focus on how the cloud can deal with the similarity search over the encrypted data, which is the same as the model adopted by previous literature [
Conceptual graph: joe buys a necktie from Hal for $10.
Text summarization: Text summarization always tries to determine the central content of documents. And the methods of automatic text summarization are mainly divided into two categories: extractive and abstractive. The extractive summarization is based on the assumption that the core idea of a document can be summarized in one sentence or a few sentences in the document. In this paper, we first preprocess the document and make it a clause. Then the words and sentences are expressed as vectors (word2vec) that the computer can understand. And the sentences are sorted by the following models. Bag Of Words [ Word Embedding [
Taking into account the above system model and to solve the problem of neglecting context semantics in the model, the following design goals will be achieved. Data privacy: our privacy goal is to prevent the cloud learn private information from the outsourced data, the corresponding index, the user queries, and search results. Concept attribute access privacy: the cloud cannot know which concept attribute is focused queried and extended. Context semantic search: the goal of our scheme is to take context semantic information into consideration in building conceptual graph to achieve more accurate search.
The searchable encryption scheme [
In our scheme, considering the efficiency of the contextual semantic extension, we only extend the semantic information of the most important topics and construct a semantic network based on conceptual graph of document and then establish a corresponding encrypted index.
In this section, we will detail our index construction scheme.
We first introduce the weighted conceptual graph [
Weighted Conceptual graph.
In our idea, the initial importance of each concept should be the equal. Then we define it as follows.
The more times a concept type appears in a document or more grammatical relations between its conceptual type and other key attributes, the more important it is.
So after we have extracted the central sentence and constructed corresponding conceptual graph, we get all its concept attribute values (rectangular) and we calculate the term frequency (each sentence is considered as a document) and the document frequency of the concept attribute value in its sentence. We use the algorithm to get our weight. We represent the concept value in the concept map as its corresponding weight.
Thus, we can effectively obtain topic attributes by statistically weighting concepts, which helps us to extend its context-sensitive semantics. Suppose we obtain the subject sentence of the document: “Apple will launch four high-performance and large-memory iPhones in 2018.” Our weighted conceptual graph for the sentence is shown in Figure
Weighted Conceptual graph for the subject sentence.
For the topic sentence of Figure
Our context-sensitive semantic expansion scheme is based on the assumption that the frequent-common attribute in the document has statistical relevance for the same topic. Therefore, we can reflect the connection relation of attributes by statistically analyzing the contextual relationships from the document collection.
We have the following definitions.
The vector of the concept attribute
In our scheme, we define that extended words and key attributes must belong to the same sentence. And it is generally believed that the closer the word to the key attribute in the document, the more times the word appears around the keyword,
Relevance between concept attribute and the word:
When we calculate the relevance of all the extended words, then we need to calculate the relevance of the extended words to the subject sentence.
The relevance of the word
Q is a set of all the different concept attributes in the key sentence.
When selecting the extended word, we need to calculate the relevance of the word and the key attributes. At the same time, through Definition
Context-based extension conceptual graph.
Similarly, for the user’s query sentence, we also need to construct a corresponding conceptual graph. And in order to return the search results which best match the user’s search intent, our paper adopts the method of [
After we obtain the context conceptual graph, we need to construct corresponding index structure, which can store all semantic information of conceptual graph. We take Figure
First, we design two vectors for the index. The first vector is mainly used to match the semantic structure in the query request. The second vector is used to store the weight of the semantic role, so that we can know the theme of the document. In our scheme, we ignore the conceptual type information in the conceptual graph because it is dispensable in our semantics. Meanwhile, we need to construct a hash table to store the corresponding concept attribute values. For the extended concept attributes, we only need to store it in the corresponding vector, so that the semantic information of the entire conceptual graph can be completely stored through our index structure.
The construction process is as follows:
For the first vector
The index structure.
Similarly, we can also generate corresponding conceptual graph for user-entered query sentence and also construct corresponding trapdoor structures. For example, the user enters a query statement: “Apple tipped to launch four iPhones in 2018.” We get its trapdoor structure as shown in Figure
The trapdoor structure.
Then, we give our retrieval scheme. The data user generates a vectors and hash table
Algorithm
We use the MRSE framework [
Then cloud server can compare whether
In essence, our proposed scheme is only some post processing further considered compared with the method in [
In this section, to assess the feasibility of our scheme, we use java+stanfordNLP to build our experimental platform. Our implementation platform is Windows 7 server with Core CPU 2.85GHz. The dataset is a real-world dataset: CNN set (
Precision means that users can get what they want based on their queries’ sentence. In our scheme, we expand the conceptual graph based on context semantic information. In order to achieve a balance between security and precision, we use 2 layers of index to store all the semantic information of the conceptual graph and also store the extended context semantic information. Thus, the retrieval accuracy of our scheme has a wider range of breadth and deeper precision.
In our scheme, we need to segment the documents of the dataset and remove the stop-word. We get topic sentences by word2vec, word-embedding, and other NLP methods, but we do not calculate the time, because the time is influenced by the corpus. Thus the time of index construction consists of 2 parts: one is to make a syntactic analysis of the subject sentence and the other is to construct the corresponding index and encrypt the index.
We can see in Figure
Index Construction overhead for 1000 documents.
Scheme | Index vectors size | The time of index vectors for each file |
---|---|---|
MRSE [ |
12898KB | 0.9s |
|
||
USSCG [ |
8394KB | 1.79s |
|
||
Our | 10738KB | 1.84s |
The time cost for generating index vectors in MRSE [
The time cost for generating index vectors.
Figures
The time cost for query in MRSE [
The time cost for query.
In this paper, for the first time, we take the relationship between semantic information of the context and conceptual graph into consideration, and we design a semantic search encryption scheme for context-based conceptual graph. By choosing the central key attributes in the topic sentence, not all attributes, our scheme performs a tradeoff between functionality and efficiency. To generate the conceptual graphs, we apply a state-of-the-art technique, i.e., word embedding and Tregex, a tool for simplifying sentences in our method. Also for the literature [
In the future, we will continue to focus our research on semantic searches using grammatical relations and other natural language processing. In addition, we are considering modifying the process of changing a conceptual graph into a numerical vector which can help improve accuracy and efficiency.
Our dataset is a real-world dataset: CNN set (
We declare that there are no conflicts of interest regarding the publication of this paper.
This work is supported by the National Natural Science Foundation of China under grant U183610040, 61772283, U1536206, U1405254, 61602253, 61672294, 61502242; by the National Key R&D Program of China under grant 2018YFB1003205; by China Postdoctoral Science Foundation (2017M610574); by the Jiangsu Basic Research Programs-Natural Science Foundation under grant numbers BK20150925 and BK20151530; by the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD) fund; by the Major Program of the National Social Science Fund of China (17ZDA092), Qing Lan Project, Meteorology Soft Sciences Project; by the Collaborative Innovation Center of Atmospheric Environment and Equipment Technology (CICAEET) fund, China.