Improved CNN-Based Hashing for Encrypted Image Retrieval

As more and more image data are stored in the encrypted form in the cloud computing environment, it has become an urgent problem that how to efficiently retrieve images on the encryption domain. Recently, Convolutional Neural Network (CNN) features have achieved promising performance in the field of image retrieval, but the high dimension of CNN features will cause low retrieval efficiency. Also, it is not suitable to directly apply them for image retrieval on the encryption domain. To solve the above issues, this paper proposes an improved CNN-based hashing method for encrypted image retrieval. First, the image size is increased and inputted into the CNN to improve the representation ability. 0en, a lightweight module is introduced to replace a part of modules in the CNN to reduce the parameters and computational cost. Finally, a hash layer is added to generate a compact binary hash code. In the retrieval process, the hash code is used for encrypted image retrieval, which greatly improves the retrieval efficiency. 0e experimental results show that the scheme allows an effective and efficient retrieval of encrypted images.


Introduction
With the development of cloud computing, more and more companies and individuals store image data on the cloud server. erefore, how to efficiently retrieve images in the cloud becomes an urgent problem. Cloud computing [1] is an emerging new computing paradigm with efficient image storage, which makes it an attractive choice for image retrieval. Despite the benefits, image information privacy becomes the main concern with image retrieval in cloud computing.
In order to protect the image information, it is necessary to encrypt the image before it is submitted to the cloud. e widely used encryption methods include chaotic image encryption [2] and Arnold transform [3]. However, it is not suitable to directly apply image retrieval technology in the plaintext domain for image retrieval on the encryption domain. erefore, how to protect image information in the cloud computing while quickly retrieving the images that users need is an urgent problem that needs to be solved in the field of encrypted image retrieval.
In the field of image retrieval, most previous approaches exploit the frequency domain feature [4,5], SIFT [6]. However, these approaches are based on hand-crafted features which cannot represent the image content comprehensively because of the low retrieval accuracy.
With the development of deep learning, the CNNs [7][8][9][10][11] have shown significant improvements in the performance on various tasks. However, the most CNNs usually have hundreds of layers, thus making networks more inefficient. Most state-of-the-art lightweight architectures, such as MobileNet [12] and ShuffleNet [13], become more efficient because of their network architectures. ese networks can be carried out in a timely fashion on a computationally limited platform.
Even though the CNN-based representation is an appealing solution for image retrieval in the plaintext domain, it is inefficient to directly compute the similarity between two CNN features, such as 4096-dimensional vectors of the full connection layer in AlexNet. Recently, some approaches have been using deep architectures for hash learning for image retrieval [14,15]. However, most of them are used for the plaintext domain, but lacks research on the encryption domain.
In order to address the above issues, this paper proposes an improved CNN-based hashing method for encrypted image retrieval (DLHEIR). In our method, we increase the size of the input image of the CNN to obtain better features and replace a part of the structure of the DenseNet network with inverted residual block to reduce the computational cost and parameters. e improved CNNs are used to generate hash codes for encrypted image retrieval.
Our main contributions are as follows: (1) is paper proposes an improved CNN-based hashing method for encrypted image retrieval (DLHEIR). is network can learn image representations to generate the binary hash code for rapid image retrieval.
(2) We used images with larger sizes as input to the CNN to obtain better features. Moreover, the inverted residual block is introduced into our method, which can reduce the computational cost and parameters.
e organization of the remaining part is given as follows. Section 2 discusses the related works. Section 3 introduces the proposed method. Section 4 shows our experimental results, and we conclude this paper in Section 5.

Related Work
Content-based image retrieval (CBIR) refers to the retrieving of the needed information in large-scale multimedia data according to the content of the image. Recently, image retrieval has been applied in many fields, such as image search [16,17] and image steganography [18]. However, it cannot be applied in cloud computing due to the privacy of images. e searchable encryption (SE) method enables the users to store encrypted data in the cloud computing and supports data search in the encrypted domain. Xia et al. [19] proposed an encrypted image retrieval scheme (PSSE) in the cloud environment, which uses MPEG-7 visual descriptors as image features. e KNN is used to protect features, and the local sensitive hashing is used to improve retrieval efficiency. Qin et al. [20] proposed an encrypted image retrieval approach in the cloud computing environment, which employs the improved Harris algorithm and Local Sensitive Hash (LSH) to retrieve encrypted images. Shen et al. [21] proposed a secure content-based image retrieval method, which uses a secure multiparty computation technique to encrypt image features. Cheng et al. [4] proposed an encrypted JPEG image retrieval scheme based on the Markov process, which uses encryption to encrypt DCT coefficients to protect the confidentiality of the JPEG image content. Xia et al. [22] proposed an outsourcing CBIR scheme based on the BOEW model. Ferreira et al. [23] proposed a secure framework for outsourcing privacy-protected storage and retrieving in a large shared image repository. Lu et al. [24] proposed a privacy protection image retrieval method based on an encrypted image collection which uses a set of visual words to represent images, and the Jaccard distance is used to measure the similarity between images. Xia et al. [25] proposed a privacy-preserving image retrieval method based on Scale Invariant Feature Transform (SIFT) features and Earth Mover's Distance (EMD). Weng et al. [26] proposed a privacy preserving framework for an application called outsourcing media search. e framework relies on multimedia hashing and symmetric encryption to protect image information. However, these approaches are based on hand-crafted features, which do not consider the global information of the image, resulting in low accuracy for encrypted image retrieval.
CNNs have recently provided an attractive solution for many version tasks. e previous approaches are attributed to the ability of CNN to learn the rich image representations, which can be applied to the field of image retrieval [27,28]. However, due to the high-computational cost of computing the similarity between two CNN features, some approaches use CNNs to automatically learn binary hashing codes [29][30][31]. However, these approaches are applicable only in the plaintext domain, and there are few approaches that focus on CNN-based encryption image retrieval.
In this paper, CNNs are applied to the field of encrypted image retrieval. With the powerful representation ability of CNNs' features, the accuracy of encrypted image retrieval is improved. At the same time, the retrieval efficiency is greatly improved by using the hash code.

System Model.
e system model is shown in Figure 1, and the system model mainly consists of three parts: data owner, cloud server, and query user.
Data owner has the image dataset M � m 1 , m 2 , . . . , m n . To preserve the image content, the dataset needs to be encrypted, generating the encrypted dataset E � e 1 , e 2 , . . . , e n }. where n is the number of images in the dataset. To achieve rapid image retrieval, the data owner needs to generate the hash code corresponding to the image dataset. Both the encrypted image and hash code are outsourced to the cloud server. e data owner also needs to send the key to the query user when receiving the retrieval request.
Cloud server stores the encrypted dataset and hash code from the data owner. When receiving the retrieval request from the query user, the cloud server needs to calculate the similarity between the hash code from the data owner and the trapdoor of the query image and returns the top k retrieval results to the query user.
Query user generates the trapdoors for the query images and uploads it to the cloud server. We define the trapdoor as the hash code for query images, which utilize the same method as the data owner does. After receiving the resulting images, the query user sends a request to the data owner and obtains the key, and the user can decrypt the encrypted image with the key.

Overview of the Proposed Method.
e proposed method mainly includes six functions, which are executed by the data owner, cloud server, and query user. e following functions are executed in the data owner: e input of the function is parameter k, and it returns the key K. After the user authorization, the data owner sends the key K to the user for decrypting the encrypted image. (1) Trapdoor Generation. TrapGen(Q) ⟶ HC q . e input of this function is the query image Q. Construct trapdoor and generate hash code HC q of query image.
(2) Image Decryption. Dec(K, R) ⟶ Img R . e inputs of this function are the key K and the similar encrypted image R returned by the cloud server, and it decrypts the similar encrypted image to return a similar image Img R . e following function is executed in the cloud server: (1) Search. Search(HC, HC q ) ⟶ R. e function calculates the similarity between HC q corresponding to the query image and the HC corresponding to the encrypted image dataset and it returns similar encrypted image set R.

Improved Convolutional Neural Network Hashing.
In this section, we will introduce our method, which consists of two main components, image preprocessing and network architecture.

Image Preprocessing.
Before training or testing the network, the input images should be resized to the same size. For example, when training and testing DenseNet, all images should resize to 224 × 224 before feeding into the network. e large image is resized to 224 × 224 or 299 × 299 by cropping or warping.
e cropping may lose important information of the image, while the warping may change the aspect ratio of the image, and this will affect the features extracted by the CNN.
Consequently, in this paper, we increase the input image size of CNNs. Specifically, for the Corel10K dataset, we calculate the maximum image height and width, and then, the largest value height and width are taken as the image size. For example, for the Corel10K dataset, the maximum image height and width in the Corel10K dataset is 384 and 256, so the size of the input image is resized to 384 × 384.

Network Architecture. Inverted Residual Block.
e network architecture of our method is shown in Figure 2. Specifically, the image is resized to 384 × 384 as the input of the DenseNet201. en, the inverted residual block is introduced to replace a part of the architecture in the Den-seNet, which can greatly reduce computational cost and parameters.
e inverted residual block consists of depthwise separable convolution. e computational cost c d of depthwise separable convolution is shown in the following equation: e parameter p d of depthwise separable convolution is computed in the following equation: For standard convolutions, the computational cost c s and parameter p s are computed by the following equation:

Security and Communication Networks
Suppose the input feature map of depthwise separable convolution has the size W in × H in × C in and the output feature map has the size W out × H out × C out , where C in and C out are the channel of the feature map, W in and W out are the width of the feature map, H in and H out are the height of the feature map, respectively, and k × k denotes the kernel size. e computational cost ratio r c of the depthwise separable convolution to standard convolution is shown in the following equation: e parameters' cost ratio r p of the depthwise separable convolution to standard convolution is shown in the following equation: Equations (4) and (5) show that the depth separable convolution uses less computational cost and parameters than standard convolution.
Hash Layer. In this section, we will systematically describe the hash layer. It consists of three main layers, which are a convolutional layer, a batch normalization layers, activation function, and a global average pooling layer. e convolutional layer is a Conv2D layer with N filters of kernel size 1-by-1. For the activation function, we choose sigmoid so that parameters are approximated to (0, 1).
Suppose the input feature map has size of l × r × q, where l, r, and q are height, width, and channel of the feature map, respectively. e output of the feature map hash layer has size of 1 × 1 × N, and the feature AP � ap 1 , ap 2 , . . . , ap q is obtained.
In feature extraction, firstly, all images are resized to 384 × 384 before being fed into the network, and the feature AP of the global average pooling layer is extracted, and the binary codes are obtained by using the hash function to binarize AP by a threshold. e hash function is shown in the following equation: where ap i is the parameter in AP and th is the threshold of the hash function.

Experimental Results and Analysis
e experiments were performed on the Corel10K dataset [32]. Corel10K is a benchmark dataset for image retrieval. It includes 100 categories, and each category contains 100 similar images. e experiment code was written in Python and Matlab R2016a under the Windows 10 system, using Intel(R) Core (TM) i7-9700KF CPU @ 3.60GHz, 16.00 GB RAM, and a Nvidia GeForce GTX 2080Ti GPU.
In the experiment, 80 images were randomly selected from each category of the Corel10K dataset as the training set, and the remaining images were used as the test set. DenseNet201 was selected as the backbone network. In the fine-tuning, we use the pretrain model which is trained on the ImageNet dataset. e stochastic gradient descent (SGD) is used as the network optimizer, the learning rate is set to 0.01, the momentum is set to 0.9, the batch size is set to 64, and epochs are set to 200.

Retrieval Precision.
In our experiments, "precision" was used as the evaluation metric, which is defined as P k � k ′ /k, where k ′ is the number of real similar images in the k retrieved images. In the experiment, we use the test set as the query image and the training set as the query image collection to test the retrieval precision. We compare our method with the other methods [6,17]. e experimental results are reported in Figure 3.
As shown in Figure 3, it is clear that our method outperforms conventional methods [6,17]. is is because these methods all utilize the hand-craft feature, which limits their performance. In particular, the performance gap is not obvious as k increases, except k � 100. Also, note that our method with 48 bits has better performance than the model with others.
We also evaluate the role of image size for retrieval precision. e experimental results are shown in Table 1.
It is clear from Table 1 that the increase in the image scale consistently improves retrieval precision in different hash bits.
is is because using larger images is beneficial for performance improvement. A scale larger than our method would instead increase the memory consumption of GPU and computational cost and parameters.

Comparison of Model Parameters and MFLOPs.
In this section, we compare parameters and MFLOPS of our method with the original CNN combined with the hash layer. e experimental results are reported in Table 2.
Floating point operations per second (FLOPs) is a measure of computer performance, which is widely used to measure the computation cost in CNN models, such as ShuffleNet [13]. As can be seen from Table 2, we can find that our method has less parameters and MFLOPs.

Efficiency.
e time consumptions of the retrieval, feature extraction, index construction, and trapdoor generation are compared in this section.
Time Consumption of Retrieval. In order to utilize the powerful computing power of the cloud server, the retrieval is applied in the cloud server, and the most similar k images are returned by calculating the Euclidean distance between two hash codes. Table 3 presents the time consumption of retrieval when retrieving images k � 20.
It can be seen from Table 3, the retrieval time consumption increases as the retrieval collection increases. It is clear that our method achieves better efficiency [6,17]. is is because our method utilized the low-dimension binary hash code, which achieved efficiency in image retrieval.

Time Consumption of Feature Extraction.
We also compared the time consumption of feature extraction with the CSD and SCD descriptors in the MPEG-7 feature extraction method of [17], and the time consumption of SIFT feature extraction in [6]. e experimental results are shown in Figure 4. Figure 4 shows the feature extraction times for different numbers of image collection. Compared with [6,17], the time consumption of our method is shorter on different numbers of image collections in most cases. is is because the time consumption of feature extraction in our method mainly consists of two parts: time consumption of the load model and hashing. Compared with complex conventional methods, our method is more efficient.
Time Consumption of Index Construction. In our method, the similarity is directly computed by two hash codes without index construction, so there is no time consumption of index construction in our method. e time consumption of index construction comparison with Xia and Qin is shown in Figure 5.    Figure 6. We test the time consumption of trapdoor generation compared with the [6,17] in Figure 6. Our method has more time consumption to these methods.
is is because, in feature extraction, we need to extract features from the deep layer of DenseNet, so the time consumption is more than [6,17].

Security Analysis
(i) e Privacy of Image Content. In our method, the images stored on the cloud server are encrypted with an encryption method. e key is generated by the data owner. us, the privacy of the image content in our scheme is well protected. (ii) e Privacy of Hash Code. e hash code may reveal the information about the image content. In our method, the hash code mapped from the feature vectors are protected by a one-way hash function. us, the hash code is well protected.

Conclusion
is paper proposes an improved CNN-based hashing method for encrypted image retrieval. In our method, we increase the size of the input image of the CNN to obtain better features and replace part of the structure of the DenseNet network with inverted residual block to reduce the computational cost and parameters, and a hash layer is added for hash code generation. ese hash codes are used for encrypted image retrieval. e experimental results show that the method achieves better performance and greatly improves the retrieval efficiency. In the future, we plan to design more efficient methods to reduce the burden on users.

Data Availability
Conflicts of Interest e authors declare that they have no conflicts of interest.