A Secure and Robust Autoencoder-Based Perceptual Image Hashing for Image Authentication

. With the advancement of technology, it has become easier to modify or tamper with digital data e ﬀ ortlessly. In recent times, the image hashing algorithm has gained popularity for image authentication applications. In this paper, a convolutional stacked denoising autoencoder (CSDAE) is utilized for producing hash codes that are robust against di ﬀ erent content preserving operations (CPOs). The CSDAE algorithm comprises mapping high-dimensional input data into hash codes while maintaining their semantic similarities. This implies that the images having similar content should have similar hash codes. To demonstrate the e ﬀ ectiveness of the model, the correlation between hash codes of semantically similar images has been evaluated. Subsequently, tampered localization is done by comparing the decoder output of the manipulated image with the hash of the actual image. Then, the localization ability of the model was measured by computing the f 1 scores between the predicted region and the original tampered region. Based on the comparative performance and receiver-operating characteristics (ROC) curve, we may conclude that the proposed hashing proposed algorithm provides improved performance compared to various state-of-the-art techniques.


Introduction
Recent developments in sophisticated image editing tools have made it very convenient for an impostor to tamper or forge the image contents. These tools allow us to add or remove content from an image very easily. The identification of manipulation becomes very important to establish image validity [1,2]. In general, perceptual image hashing strategies resolve this problem. These techniques are used to extract the most important features from an image for calculating a hash. The hash codes may be produced by traditional hashing algorithms [3][4][5] like MD5 or SHA-256. However, these techniques are susceptible to the data, i.e., any bit changes result in different hash codes. This behavior is undesirable because digital images are constantly subjected to unintended improvements such as compression and enhancement. The objective of perceptional hash algorithms is to generate hash codes that take only changes in the district region into account. In contrast to image hashing algorithms, few other techniques such as watermarking [6], cryptography [7], and image encryption [8][9][10] have been developed to transmit data through a secured channel or hide it. However, with such methodologies, it is difficult to detect and localize the tampered region. The effectiveness of a perceptual hash can be measured by its robustness against various content preserving operations and its sensitivity to different malicious content removing or adding functions [11][12][13][14].
Autoencoders have been found as a very effective technique for unsupervised learning of image hash functions due to their ability to discover the essential features from unlabeled data. Most of these are fully integrated forward feed networks with a code layer regularization that allows the model to understand data collections rather than only copying the input into the output. This motivated us to use a convolutional stacked autoencoder to tackle the issue of creating a hash with a perceptual image that is robust against enhancement and compression changes in the image while being sensitive to the content removing operations.
Recently, machine learning and deep learning techniques have been employed in various fields of image processing [15,16]. Similarly, the autoencoder utilized in this article is based on an artificial neural network (ANN). The network is trained hierarchically to represent input images in the latent space having 1024 dimensions and mapping the same back to the dimensions of the input image. We rely on the ability of the autoencoder to learn the underlying features of the dataset without the need for labeled data. L2 regularization is applied to the hash code layer to help the model to better generalize the training dataset and to prevent the model from learning an identity function. The network is influenced by two effects of the L2 regularization that was chosen. Firstly, the hash code deletes unnecessary components by using the smallest combination to solve the problem of learning. Secondly, it eliminates the impact of static noise to improve the generalization of the model. Tampering detection is done by comparing the correlation between the initial image hash code and the hash code of the manipulated image with a typical threshold set to 0.98. Both of these hash codes are generated by the proposed model. Any image having a correlation less than the threshold is considered to tamper. The decoder part of the proposed model then compares the decoder outputs generated from the respective hash codes to create the probably tampered region in the images.
The robustness of the hash codes is evaluated by measuring the true-positive and false-positive rates for different CPOs on the image. The localization ability of the proposed model is evaluated by computing the f 1 scores between the raw difference of both the real and the tampered image and the raw difference of the decoder outputs of the original image and the tampered image.

Local and Global
Feature-Based Hashing. In the last few decades, several research works have been done in the field of image hashing. In 2005, Monga and Evans [1] utilized both visually significant features and used probabilistic quantization features to construct the robust image hash, and the disadvantage associated with this method is that it does not allow the exploration of alternative image recognition and representation based on pseudorandom signals. Later on in 2007, Monga and Mihçak [3] utilized the nonnegative matrix factorization (NMF) method for generating image hashes. However, the primary disadvantage of this approach is its computational time. In this method, the time taken for hash extraction was 2.03 seconds when the hash length was 300 bits. Likewise, in 2006, Swaminathan et al. [2] also proposed a new algorithm based on Fourier transformation and supervised randomization for the generation of an image hash. This technique suffers from the limitation of randomized quantization.
In 2009, Wu et al. [4] built a Radon and wavelet transform-based printed-scan-resistant image hashing algorithm. In the same year, a virtual watermark detectionbased image hashing technique was proposed by Khelifi and Jiang [5]. However, these techniques failed to address various geometric attacks such as shearing and were unable to provide adequate performance. In 2010, Ahmed et al. [17] added a hidden key for modulating pixels dynamically, leading to transformed space. The picture hash is then computed using the key-dependent transformed function space. A 4-bit quantizing scheme to reduce the hash is also proposed; however, the drawback of this approach is that it does not resist other criteria such as brightness changes, contrast improvement, and tampering that involves smooth changes in gray level values.

Transform-Based
Hashing. In 2011, Lei et al. [11] incorporated the Radon transform (RT) along with DFT for constructing the robust image hash. In 2011, Tang et al. and Choi and Park [12,13] suggested a method for creating image hashes based on a lexicographically structured architecture. Dictionary development and upkeep, as well as hash production, are two aspects of the scheme, but the shortcoming of this method is that it does not address issues such as image rotation, color features, more complex dictionary creation, and mechanisms related to maintenance. To build the image hash, in 2012, Li et al. and Lv and Jane Wang [14,18] presented a solid hash function dependent on dithered lattice vector quantization and random Gabor filtering. In 2013, Tang et al. [19] discussed a method that produces a hash by transforming the original image into a normalized version, i.e., by separating the image into sections and obtaining the entropies based on rings. Afterward, in 2014, Tang et al. [20][21][22][23][24] developed another effective image hash using a ring partition and an NMF. But the drawback of this technique is that it is not rotation-invariant. Moreover, it cannot address issues such as detection of tampering in small areas, localization of tampering, and effective extraction of color attributes.
In 2015, Sebastian et al. [25] proposed a technique for hashing images that use Haralick and modified local binary pattern features, as well as luminance and chrominance channels. In the same year, Ouyang et al. [26] utilized logpolar and Quadrature DFT transform for the generation of image hash, but both these algorithms are sensitive to geometric operations. In 2016, various image hashing methods were developed based on invariant vector distance and ring partition [27], adaptive and local feature extraction [28], quaternion Fourier-Mellin transforms (QFMT) [29], block truncation coding (BTC) [30], local linear embedding (LLE) with DCT [31], Canny operator with color vector angle [32], center-symmetric local binary patterns [33], projected gradient nonnegative matrix factorization (PGNMF), and ring partition [34]. However, all these techniques are unable to provide adequate performance in terms of tamper localization and content recovery. In 2017, Tang et al. [35] introduced multidimensional data scaling (MDS) in producing robust image hash for data analysis and object retrieval. In this same year, Karsh et al. [36] used the singular value 2 Wireless Communications and Mobile Computing decomposition (SVD) method to find a low-rank matrix followed by discrete wavelet transform (DWT) to generate a robust image hash and failed to detect color forgery and is more sensitive to translation.

Statistical
Feature-Based Hashing. Later in 2018, several new techniques based on image features for the hashing algorithm were developed, which include extraction of structural features from color images [37], dual-cross pattern-based textural features [38], progressive feature point selection [39], and a geometric correction-based technique using local and global features to counter the rotation scaling translation (RST) attacks [40]. However, these algorithms are unable to verify the validity of all types of images from all across the globe. Similarly, in 2019, Tang et al. [41] developed a new methodology on the basis of tensor decomposition.
In that year, Qin et al. [42] integrated local texture and color angle characteristics in generating a robust image hash. However, these methods cannot be applied to video hashing. Recently, to improve the performance of image hashing, various researchers have proposed different techniques such as a Binary Multi-View Perceptual Hashing (BMVPH) [43], a Gray-level cooccurrence matrix-based hashing [44], Fourier-Mellin transform and fractal codingbased technique to create a fingerprint image [45], fractal image coding and ring partition-based hashing [46], quadtree structure and color opponent component-(COC-) based technique for forging detection and tampering localization [47], and a Laplacian pyramid-based hashing technique [48]. However, these algorithms fail to provide adequate performance against some attacks like rotation invariant. Additionally, these techniques are also unable to provide satisfactory performance in case of tamper localization. Apart from the above categories, some other hashing algorithms have also been proposed. These are based on deep ordinal hashing [49], deep-network-based hashing [50], deep transfer networks (DTNs) [51,52], image fusion [53,54], etc. Some of the hashing algorithms [49,50] have utilized the t-Distributed Stochastic Neighbor Embedding (t-SNE) technique for visualization of the learned hash features. The t-SNE is a technique for dimensionality reduction that is particularly well suited for the exploration and visualization of high-dimensional data into low-dimensional space, and it finds the patterns in the data based on similarity of data points. Most of the above-stated works are concerned with making hash values more robust, and others concentrated on localizing the tampered areas. Our objective is to train a single-layered convolutional autoencoder to produce hash values resistant to various geometric attacks while detecting and localizing tampered regions.

The following Are the Contributions of the Proposed Algorithm
(1) Existing literature suggests that although most of the methods are robust to content preserving operations, however, the techniques are very sensitive to geometric operations. In this work, an autoencoder-based image hashing algorithm has been developed to overcome this problem (2) The proposed image hashing algorithm is capable of detecting and localizing minor tampering portions in the images, unlike most of the existing algorithms (3) Experimental results suggest that the presented algorithm is capable of proving improved performance irrespective of types of images from various databases. For instance, experiments were performed on CASIA Tampered image detection evaluation database [55], NITS Image hashing database [56], USC-SIPI Image database [57], and Ground Truth Database [58] for tampering detection and localization to check the ability of the proposed algorithm (4) A comparative analysis with different state-of-the-art techniques suggests the competitiveness of the proposed algorithm. The performance parameters such as the area under the ROC curve (AUC), truepositive rate (TPR), and false-positive rate (FPR) [59][60][61][62][63] are utilized to evaluate the algorithms The remaining part of the paper has been structured in the following manner. Section 2 discusses the relevant literature. Section 3 presents the method being developed, the model architecture, the proposed approach, and implementation details. The experimental results and discussions have been presented in Section 4. The comparison with existing works has been discussed in Section 5, while the whole work has been concluded in Section 6.

Proposed Model Architecture
The model is built on stacked convolutional autoencoders. In this model, the encoder network maps input images into latent space, and it is decoded and recreated into the original image. Our network of encoders includes five numbers of convolutional autoencoders followed by a fully connected network whose activations are subjected to L2 regularization. This is the layer generating the hash code. Each convolutional autoencoder comprises conv-relu-batch norm-max pooling in the encoder and upsampling-batch norm-conv in the decoder part. The architecture of the proposed model is shown in Figure 1, whereas the visualization of the learned hashing features with Gradcam and Gradcam++ technique of original host image is presented in Figure 2. Similarly, the visualization of the learned hashing features for a tampered image is also done and demonstrated in Figure 3. It is observed that the heatmap generated and Gradcam technique gave identical results. Hence, the effectiveness of the proposed method has been proved by the visualization of feature activation shown by Gradcam.
3.1. Convolutional Autoencoder. M convolutional layers make up the network, with a completely connected layer in the middle providing the hash code. Then, the output given by the encoder can be given by Here, upsampling refers to the bilinear interpolation done to the image, after which convolution is done in the decoder part. ðW, bÞ and ðW ' , b ' Þ refer to the weights and biases of the convolution layer in the encoder and the decoder network, respectively. The activation function used in our case is ReLu (rectified linear unit). This is used everywhere in the middle of the network except in the fully connected middle layer, where the sigmoid activation function is used. The max-pool refers to the max pooling operation with a stride of 2. The parameters of the model are learned by using the Adagrad optimizer. The model learns by minimizing the mean square error among the input x 0 and the output y 0 : where N is the total number of samples in the dataset.

Fully Connected Autoencoder.
Both the encoder and the decoder network, a completely connected autoencoder, is placed to reduce the size of the hash code much more. It consists of three fully linked layers. The coding for the images is provided by the hidden layer. We test both a totally stacked and a fully linked autoencoder. It is also simpler to train and refine the completely convolutional encoder. Because of weight distribution in the convolutional layers, it is, therefore, less capable of approximating. The same MSE loss Equation (1) is used to train completely convolutional and convolutional plus fully connected layers.  (128, 128, 3). The model is trained on the USC-SIPI dataset [57]. The model is trained on 40000 input images. Noisy images are given to the model as input, and the corresponding denoised images are given as output to train the model. The noisy images in this context refer to the original images which have undergone any one of the following operations, viz., changes in brightness, contrast, gamma correction, adding of Gaussian, salt and pepper, and speckle noise, image scaling, rotation, and compression. The input images are passed through 5 convolu-tional units, which make the encoder network. Each of these convolutional units comprises a convolutional layer having 16 filters, a batch-normalization layer, which is accompanied by a max-pooling layer with a filter scale of (3, 3) and a stride of 2. This is passed to a fully connected autoencoder which generates the hash code of (1, 48) dimensions. The activations of the middle layer of this fully connected autoencoder are subjected to L2 regularization. This is then passed to a decoder network that attempts to reconstruct the image. The decoder has five convolutional units, each of which comprises an upsampling layer [60]. Each of where α is the regularization coefficient and kh j k 2 2 is the squared L2 norm of h j which are the activations of the hash code layer for the jth image. For our experiments, we take the hash code layer to have 48 dimensions.

Implementation Details.
To construct the model, we used the Keras library with a Tensorflow backend. The model is trained using the online Google-Colab deep learning platform and uses Tesla K80GPU for training. Each of the weights of the convolutional layer is taken from a glorot-uniform distribution which is the default setting for Keras. The bias for each of these layers is set to 0. We train  The aforementioned dataset [58] comprises the following: "scenery_bmp," "animals," "aerials," "Indonesia," "Japan," "Italy," "operations Italy," "operations animals," "operations Indonesia," "operations scenery." The folders having the name "operations" as their prefix contain the results of the different content preserving operations done on the original image. For example, the "animals" folder comprises the original images, while the "operationsanimals" folder contains the tampered images produced by the different content preserving operations for each image in the "animals" folder. The first three "animals," "scenery_ bmp," and "aerials" and their corresponding "operations_" folders are used for training the model, while the rest of the dataset is used for testing the robustness of the model. The model is tested on 18334 images from "operations Italy," "operations Japan," and "operations Indonesia." The robustness of the hash codes produced is tested on the basis of the two metrics "true-positive rate" (TPR) and "false-positive rate" (FPR) scores which are defined below. We take the TPR and FPR scores for different operations on the image for the three "operations folders," i.e., "Indonesia," "Italy," and Japan. A typical value of 0.98 is taken as the threshold for hash correlation. During the testing phase, both the original image and the tampered image are passed through the model, which produces their respective hash codes. If the correlation between the hash codes is less than that of the threshold, the tampered image is classified as   The model is then tested for its tampering localization ability.
In Equations (3) and (4), n same and n different are same image pairs that are detected and different image pairs that are mistakenly detected; N similar and N different are total pairs of similar images and total pairs of different images.
During the tampering-localization test, the model receives the tampered image and the hash code of the original image as the input. This hash code was generated when the corresponding original image was put in through the encoder network of the model. The tampered image is passed through the encoder and then the decoder network to give the reconstructed tampered image n 1 . The hash code of the original image is passed through the decoder to produce n 2 . Then, the tampered region is in The predicted tampered region R is compared with the actual tampered region T using the f 1 score in Here, nðR ∩ TÞ refers to the total number of tampered pixels that are detected by the model, and nðTÞ refers to the total number of tampered pixels in the image. T is the set containing the indexes of all the tampered pixels corre-sponding to that image. The entire process is represented in Figure 4.

Experimental Results
In the following subsections, the training procedure and results have been presented. We also compare the localization ability of the proposed model by comparing the f 1-scores of the proposed model against the cutting-edge methods.

4.1.
Training. The autoencoder is trained in a layer-wise fashion. Here, we freeze the weights of all convolutional units except a single one and train the model for 50 epochs. This entire process is repeated for all the convolutional units in both the encoder and the decoder network before being fine-tuned as a whole. The model is trained during the tuning process in 100 epochs with a small size of 400 images each of 128 × 128 × 3 size. The λ coefficient of L2 regularization is kept to 0.01 for the entire duration of the training.

Hash Robustness Test.
We test the robustness of the hash codes produced by the encoder on the "operations Indonesia," "operations Italy," and "operations Japan" folders of our custom dataset [58]. The true-positive rate and the false-positive scores are shown in Table 1. The truepositive rate refers to the fraction of the total testing images which are untampered and are classified to the untampered by our model. Similarly, the false-positive rate refers to the untampered images which are classified to be tampered by our model. In this test, the hash correlation threshold is taken as 0.98. Content preserving operations for various parameters are shown in Table 2. Few standard images used for the robustness test are shown in Figure 5.
From Table 2, we should conclude that the proposed model is completely immune to all the operations listed except "rotation operations" on the image, which shows a lower value of true positive rate in comparison to all the other operations. The hash correlation for different  10 Wireless Communications and Mobile Computing operations corresponding to the three testing folders is shown in Figure 6. The robustness test based on standard images is shown in Figure 7.

Localization Capability Test.
The tampering ability is tested on 763 tampered images [55]. These images are categorized into 3 "large-tampered," "medium-tampered," and "small-tampered" folders, depending on the degree of tampering that the image has undergone. There are 365 largetampered images, 198 medium-tampered images, and 200 small-tampered images. The tampering localization test process can be summed up in the flowchart shown in Figure 8. Each sample comprises three images, the first being the original image, the second being a tampered version of that    Figure 9 shows some samples where our model has been successfully able to detect all the tampered regions in an image.

Discernibility Test.
Here, the proposed model is tested for its ability to distinguish between two semantically dissimilar images. We took 200 different images [56] and made combinations of 2 images each, thereby making 200 C 2 samples. We tested the discerning capability of the model by computing the true-positive rate scores and the falsepositive rate scores. The TPR shows the percentage of the total samples which the model classifies to be different. If the hash correlation between the hash codes produced by the encoder network given the two different images of a sample is greater than 0.98, then the model classifies the two images to be semantically same, else different. The histograms of the hash correlation of different image samples are shown in Figure 10. Hash correlation between different tampered image pairs is shown in Figure 11.

Comparison with Existing Work
This section compares the proposed algorithm results against classical perceptual hash-producing algorithms. In Table 3, Table 4, Table 5, and Table 6, the proposed approach is compared to some of the existing perceptual image hashing algorithms. It is observed from the table that the hash codes produced by the proposed model are robust against scaling, rotation, watermarking, jpeg compression, gamma correction, and Gaussian low pass filter. Table 1 shows true-positive rates for different degrees of rotation. The true-positive rates here refer to the total untampered, rotated images that the model classifies to be the same (i.e., the hash correlation between the hashes of the original and the rotated image is less than the threshold). From Table 1, it becomes evident that the robustness of the hash codes produced by the proposed approach is competitive with those produced by the state-of-the-art approaches. Despite the fact that we are not using any previously annotated data during training, the proposed model performance is better or comparable to the other conventional perceptual hashing algorithms in execution time, area under ROC curve (AUC), and average time. The ROC curve in Figure 12 consists of several points (TPR, FPR) with different thresholds and is used to assess the equilibrium between robustness and discrimination.  Multidimensional scaling (MDS) L. Chen et al. [41] Tensor decomposition (TD) R. K. Karsh et al. [40] Geometric correction Z. Tang et al. [27] Ring partition and invariant vector distance (RP-IVD) C. Qin et al. [30] Block truncation coding (BTC) Q. Shen et al. [64] Color opponent component and quadtree structure C. Qin et al. [42] Weber local binary pattern and color angle representation R. K. Karsh et al. [36] DWT-SVD and spectral residual X. Zhang et al. [19] Three-dimensional color structure features and luminance gradient H. Hamid et al., [48] Laplacian pyramids  [39] 262 digits C. P. Yan et al. [28] 302 bits Z. Tang et al. [20] 206 bits F. Khelaifi et al. [46] 100 digits H. Lao et al. [31] 316 bits Q. Shen et al. [64] 452 digits L. Du et al. [43] 512 bits Proposed method 64 bits 13 Wireless Communications and Mobile Computing

Conclusions and Future Scope
In this paper, a stacked convolutional autoencoder with an L2 regularization has been proposed to produce hash values. The hash values are not only robust against the enhancement and geometric attacks but would also be robust against various content preserving operations like image compression, the addition of Gaussian, speckle noise, scaling, and watermarking. The convolutional units help us to learn high-level semantic information from the data manifolds. The investigative conclusion on the massive pairs of images indicates that the system can detect and locate minor tampering in the images. Moreover, it offers a more favorable contrast between FPR and TPR. After training only for 100 epochs, the proposed model shows a competitive performance with those of the state-of-the-art approaches in both the hash robustness test and tampering localization test. It may even localize tampering, in spite of tampering and image rotation taking place at the same time, which is a significant drawback of current approaches.
The future scope of this work will consist of pretraining the network's weight matrices with stochastic and generative neural networks such as Boltzmann distribution to accomplish quicker convergence, which helps to reduce mean square error loss. The robustness of CPOs can also be examined in Dense Nets. This method will be devised in order to shorten the hash code while maintaining machine performance.  Filter size 2−7 with step size 1 1−8 with step size 1 1−9 with step size 1 By Y Chen [16] By F Ahmed [6] By Swaminathan [2] By V Monga [3] By Tang [20] By R K Karsh [27] Proposed Hashing  14 Wireless Communications and Mobile Computing

Data Availability
The processed data are available upon request from the corresponding author.

Conflicts of Interest
The authors declare that they have no conflicts of interest.