Multimedia hashing is a useful technology of multimedia management, e.g., multimedia search and multimedia security. This paper proposes a robust multimedia hashing for processing videos. The proposed video hashing constructs a high-dimensional matrix via gradient features in the discrete wavelet transform (DWT) domain of preprocessed video, learns low-dimensional features from high-dimensional matrix via multidimensional scaling, and calculates video hash by ordinal measures of the learned low-dimensional features. Extensive experiments on 8300 videos are performed to examine the proposed video hashing. Performance comparisons reveal that the proposed scheme is better than several state-of-the-art schemes in balancing the performances of robustness and discrimination.
National Natural Science Foundation of China619620086206201361762017Guangxi “Bagui Scholar” Team for Innovation and ResearchGuangxi Talent Highland Project of Big Data Intelligence and ApplicationGuangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent ProcessingInnovation Project of Guangxi Graduate EducationXYCSZ20210091. Introduction
In the digital era, multimedia (e.g., image and video) is easily captured via smart devices, such as smart phone and iPad. Many people would like to use multimedia to record their lives and share them with friends on the Internet. Consequently, there are a large number of multimedia data in the cloud server. Efficient technologies of multimedia management, e.g., multimedia search and multimedia security [1–3], are thus in demand. To address these issues, robust multimedia hashing, such as audio hashing [4, 5], image hashing [6, 7], and video hashing [8, 9], are proposed to process different kinds of multimedia data in recent years. Robust multimedia hashing can map multimedia data to a content-based short sequence called hash and finds a lot of applications [10–14], such as copy detection, authentication, tampering detection, forensics, and retrieval. In this work, we propose a novel multimedia hashing based on multidimensional scaling (MDS) and ordinal measures for processing videos.
Generally, multimedia hashing for videos should identify visually similar videos which are generated by manipulating videos with normal digital operations, such as compression and filtering. This is the property of video hashing called robustness. As there are many different videos in practical applications, video hashing should meet the property called discrimination. This property can ensure that video hashing can efficiently distinguish different videos from massive videos. Note that the discrimination and robustness are two basic properties of video hashing. For specific applications, video hashing should satisfy additional property. For example, it should be key-dependent for video authentication and forensics.
Many scholars have designed diverse video hashing schemes in the past years. As discrete cosine transform (DCT) has been widely used in some compression techniques, such as JPEG compression and MPEG-2 compression, it is extensively investigated in video hashing design. A well-known robust video hashing was introduced by De Roover et al. [15]. They computed key frames from video sequence and derived hash vector from every frame by compressed radial luminance projections with DCT. This key frame-based scheme can resist slight geometric deformation and temporal subsampling, but it is time-consuming due to high computational cost of radial luminance projection. Coskun et al. [16] investigated the use of DCT in video hashing and presented two effective hashing schemes by using the classical basis set and a randomized basis set. Their schemes can both withstand blurring and MPEG-4 compression. In another study, Li [17] calculated random pixel cubes, applied 3D DCT to these cubes, and exploited energy relationships of DCT coefficients to make video hash. This scheme is secure and robust to MPEG-2 compression. Mao et al. [18] jointly exploited 3D DCT and the classical locality-sensitive hashing (LSH) to design video hashing for copy detection. This DCT-LSH scheme has a high precision rate. Esmaeili et al. [19] generated temporally informative representative image (TIRI) from video frames by calculating a weighted sum of frames, applied 2D DCT to overlapping blocks of every TIRI and selected the first vertical and the first horizontal coefficients to construct hash. The TIRI-DCT scheme is robust to noise and frame loss. Setyawan and Timotius [20] calculated video hash by using edge orientation histogram (EOH) and DCT. The EOH-DCT scheme is robust against luminance modification and MPEG compression.
Besides DCT, other useful techniques are also used in video hashing. For example, Mucedero et al. [21] calculated a standard video by filtering and resampling input video and extracted robust hash by computing the minimum block values from the matrix of block-based pixel variances. This scheme can identify those videos compressed by the technique of MPEG-2 or MPEG-4. Xiang et al. [22] exploited Gaussian filtering and luminance histogram to design video hashing. Their scheme can resist geometric attacks. Sun et al. [23] jointly used the TIRI [19] and visual attention model to make a novel video hashing scheme with weighted matching. This scheme demonstrates good performances in terms of recall and precision rates. Li and Monga [24] viewed videos as tensors and exploited the subspace projections of tensors, i.e., low-rank tensor approximations (LRTA), to calculate video hash. The LRTA hashing is resilient to blurring, compression, and rotation. In another work, Li and Monga [25] proposed to represent videos by graphs and used structural graphical models to derive video hash. This scheme can generate a compact hash without losing detection performance. As TIRI based video hashing schemes receive much attentions, Liu et al. [26] exploited dynamic and static attention models to develop a novel temporally weighting method for TIRI generation. The method helps to improve hash performance.
Recently, motivated by the ring partition reported in [27], Nie et al. [28] exploited spherical torus (ST) to conduct video partition and used nonnegative matrix factorization (NMF) to extract hash. The ST-NMF hashing is robust to noise and blurring. Sun et al. [29] extracted attention features via a visual attention model and combined them with visual-appearance features via a deep belief network to generate hash. This hashing scheme can resist Gaussian noise, Gaussian blurring, and median filtering. Rameshnath and Bora [30] utilized temporal wavelet transform (TWT) to generate TIRIs and conducted random projection with the Achlioptas’s Random Matrix (ARM). The TWT-ARM hashing shows good robustness against MPEG-4 compression, watermark insertion, and Gaussian blurring. In another work, Tang et al. [31] used DCT to construct feature matrices and learned video hash from the matrices via NMF. This hashing is resistant to frame scaling and frame rotation with small angle. Chen et al. [32] used low-rank and spare decomposition (LRSD) to calculate feature matrix of each frame and exploited 2D DWT to perform feature compression. The LRSD-DWT hashing can resist MPEG-4 compression, Gaussian low-pass filtering, and frame rotation with small angle.
From the above survey, it can be found that many reported video hashing schemes can make good robustness against some digital operations. But they do not reach a desirable balance between the performances of robustness and discrimination yet. To address this issue, we jointly exploit DWT, gradient information, MDS, and ordinal measures to develop a novel video hashing, which can make a good balance between the two performances. Compared with the existing video hashing schemes, the main contributions of the proposed video hashing are presented as follows.
High-dimensional matrix is constructed by using gradient features in the DWT domain of preprocessed video. Since gradient information can measure structural image features which are almost kept unchanged after digital operations, gradient features can effectively capture visual content of video frame. Therefore, the gradient features based high-dimensional matrix can guarantee good robustness against digital operations and distinguish videos with different contents.
Low-dimensional features are learned from the high-dimensional matrix via MDS. The MDS is an efficient technique of data dimensionality reduction. It can effectively learn discriminative low-dimensional features from the high-dimensional data by preserving original relationship of the data. So the learned low-dimensional features can make discriminative and compact video hash.
Video hash is generated by using ordinal measures of the low-dimensional features. The ordinal measures are robust and discriminative features, and their elements are all integers. Therefore, the use of the ordinal measures can contribute to a robust and discriminative video hash with short length.
Extensive experiments are performed to test the proposed video hashing. The results reveal that the proposed video hashing has a good robustness and desirable discrimination. Comparisons with several state-of-the-art hashing schemes demonstrate that the proposed video hashing is better than the compared schemes in balancing the performances of robustness and discrimination. The rest of the paper is structured as follows. The proposed video hashing is explained in Section 2. The experimental results and comparisons are discussed in Section 3 and Section 4, respectively. Conclusions are presented in Section 5.
2. Proposed Video Hashing
The proposed video hashing can be decomposed into four components. Figure 1 shows the four components of the proposed scheme. The first component is the preprocessing which is used to make a normalized video. The second component is the high-dimensional matrix construction by using gradient features in the DWT domain. The third component is to learn low-dimensional features from the high-dimensional matrix via MDS. The final component is to calculate a compact hash by using ordinal measures of the learned low-dimensional features. The details of these components are introduced below.
Components of the proposed scheme.
2.1. Preprocessing
Temporal-spatial resampling is applied to the input video. Specifically, the temporal resampling is firstly used to map different videos to the same frame number. To do so, the pixels in the same positions of all frames are orderly picked to form a pixel tube. Every pixel tube is then mapped to a fixed length M by the linear interpolation. After this operation, a video with M frames is generated. Next, the spatial resampling is used to convert the frame resolution to a standard size N × N by bicubic interpolation. Consequently, a video with M frames sized N×N is generated after the temporal-spatial resampling.
If the input video is an RGB color video, the resized video is converted to the well-known color space called HSI space, and the intensity color component “I” in the HSI space is selected to denote the resized video. The HSI space is generally described by a conical color space. Conversion from the RGB space to the HSI space can be implemented by the following rules.(1)I=13R+G+B,(2)S=1−3×minR,G,BR+G+B,(3)H=θ,if B≤G,360−θ,if B>G,(4)θ=cos−1R−G+R−B2R−G2+R−BG−B1/2,in which R is the red channel of pixel, G is the green channel, B is the blue channel, I is the intensity component, S is the saturation component, and H is the hue component.
2.2. High-Dimensional Matrix Construction
Structural features are important image features, which can effectively describe the visual content of video frame and are almost kept unchanged after digital operations. Since image gradient [33] can measure structural features, gradient features are used to construct a high-dimensional matrix. To extract local video features, the intensity component of the resized video is divided into m groups of video frames. For each group, a secondary frame is firstly calculated. For simplicity, let M be an integer multiple of m. Thus, there are b=M/m frames in every video group. Let Qli,j,k be the pixel in the i-th row and the j-th column of the k-th frame in the l-th video group and Fl be the secondary frame of the l-th video group, whose element in the i-th row and the j-th column is denoted by Fl (i, j). Therefore, the secondary frame Fl can be determined by(5)Fli,j=1b∑k=1bQli,j,k.
After these operations, there are m secondary frames in total.
Next, one-level 2D DWT is applied to each secondary frame. Note that four sub-bands are obtained after decomposition, i.e., LL sub-band, LH sub-band, HL sub-band, and HH sub-band. As DWT coefficients in the LL sub-band contain approximation information of secondary frame, the LL sub-band is used to construct high-dimensional matrix. Moreover, the DWT coefficients in the LL sub-band are slightly influenced by compression and noise. Consequently, features extracted from LL sub-band can make a robust high-dimensional matrix. Suppose that the size of the LL sub-band of one-level 2D DWT is s × s. Therefore, s = N/2, where . is the upward rounding operation.
Let fx,y,l be the element in the coordinates (x, y) of the LL sub-band of the secondary frame of the l-th video group. Thus, its gradient can be determined by the vector defined in(6)∇f=GxGyT=∂f∂x∂f∂yT,where Gx and Gy are the partial derivatives approximately determined as follows:(7)Gx=fx+1,y,l−fx−1,y,l,(8)Gy=fx,y+1,l−fx,y−1,l,In general, gradient feature can be denoted by its magnitude r(x, y, l) or its orientation ϕ (x, y, l), whose definitions can be found in equation (9) and equation (10), respectively: (9)rx,y,l=Gx2+Gy2,(10)ϕx,y,l=tan−1GyGx.
As the orientation of image structure is changed after rotation, the gradient magnitude is selected as the feature instead of the orientation.
After the calculation of gradient magnitude, a gradient feature matrix sized s × s is obtained. To extract local gradient feature, the matrix of gradient magnitudes is divided into nonoverlapping blocks sized t × t. For simplicity, let s be an integer multiple of t. Thus, there are both n=s/t blocks in the horizontal direction and the vertical direction. Let Bi,j,l be the block of the gradient feature matrix of the l-th secondary frame, where 1≤i≤n and 1≤j≤n. Thus, the mean of the block Bi,j,l can be calculated by the following equation:(11)ci,j,l=1t2∑k=1t2Bi,j,lk,in which Bi,j,lk is the k-th element of Bi,j,l. Therefore, a gradient feature sequence of the secondary frame Fl can be generated by arranging these means as follows:(12)cl=c1,1,l,c1,2,l,…,c1,n,l,c2,1,l,c2,2,l,…,c2,n,l,…,cn,1,l,cn,2,l,…,cn,n,l.
Finally, a high-dimensional matrix can be constructed by stacking these m feature sequences as follows:(13)C=c1c2⋮cm.
Note that the size of the high-dimensional feature matrix is m × q, where q = n2.
2.3. Multidimensional Scaling
In order to find low-dimensional data from the high-dimensional feature matrix C, a well-known technique of data dimension reduction called MDS is exploited in this work. The reasons of selecting MDS to learn low-dimensional data are as follows. (1) MDS has shown good performances in many applications, including object retrieval [34], localization [35], data visualization [36], and image hashing [37]. (2) MDS can effectively learn discriminative low-dimensional features from high-dimensional data by preserving original relationship of the data. In general, the classical MDS consists of three steps [38], namely, distance matrix computation, inner product matrix calculation, and low-dimensional matrix extraction, which are described as follows.
(1) Distance Matrix Computation. For each row ci (1≤i≤m), the Euclidean distance di,j between ci and cj (1≤j≤m) is computed. Thus, the distance matrix D = di,jm×m is available.(14)D=d1,1d1,2d2,1d2,2⋯d1,m⋯d2,m⋯⋯dm,1dm,2⋯⋯…dm,m,where di,j is calculated as follows:(15)di,j=∑k=1qcik−cjk2,in which cik and cjk are the k-th elements of ci and cj (1≤k≤q), respectively.
(2) Inner Product Matrix Calculation. The distance matrix D is transformed into the inner product matrix T by the following equation:(16)T=−12PDP,in which P is a centralizable matrix determined by(17)P=E−1meeT,where E is a unit matrix sized m×m and e denotes a unit vector sized m × 1. Thus, Pe = 0 and PT=P.
(3) Low-Dimensional Matrix Extraction. Since T is symmetric and semipositive definite, it can be decomposed as follows:(18)T=RART,in which R is an orthogonal matrix of eigenvectors and A is the diagonalized matrix of eigenvalues of T. Therefore, a low-dimensional matrix U can be obtained by selecting the first p columns of X.(19)X=RA1/2.
The size of the low-dimensional matrix U is m × p (p<q), and p is the selected dimension of MDS.
To generate a short and discriminative video feature sequence, the variance of each row of the low-dimensional matrix U is calculated. Suppose that ui=ui1,ui2,…,uip is the i-th row of the matrix U (1≤i≤m). Thus, its variance vi can be determined by (20)vi=1p−1∑j=1puij−μi2.where μi is the mean of ui defined by(21)μi=1p∑j=1puij.
After the variance calculation, a video feature sequence is obtained as follows:(22)v=v1,v2,…,vm.
Note that all elements of the feature sequence are floating-point numbers.
2.4. Ordinal Measures
In practice, storage of a floating-point number requires many bits in a computer system. For example, 32 bits are needed in terms of the IEEE standard [39]. Therefore, the length of the feature sequence is 32m bits, which is a large number when the m value is big. To reduce the cost of the feature sequence, a technique called ordinal measures (OM) [40] is used to conduct quantization. The OM technique can extract short and robust feature codes and has been used in many applications of image and video processing [41–45]. Generally, the feature codes of the OM technique can be calculated by sorting a data sequence in ascending order and selecting the positions in the sorted sequence. To better understand the OM technique, an example is presented in Table 1. In this table, eight numbers of an original data sequence are listed in the second row, their sorted versions in ascending order are presented in the third row, and their feature codes of the OM technique are shown in the fourth row. For example, the 1st element of the original data sequence is 20, which is ranked at the 6th position in the sorted sequence. Therefore, its feature code of the OM technique is 6. The OM codes of other elements can be determined by similar calculation.
An example of the OM technique.
Position
1
2
3
4
5
6
7
8
Original sequence
20
18
22
10
7
15
25
12
Sorted sequence
7
10
12
15
18
20
22
25
Ordinal measures
6
5
7
2
1
4
8
3
Here, the OM codes of the elements of the feature sequence v are selected as the elements of our video hash. Suppose that hi is the OM code of vi1≤i≤m. Thus, the video hash can be denoted by h as follows:(23)h=h1,h2,…,hm.Therefore, the length of the video hash is m integers. Since each integer needs log2m bits at most, the hash length is mlog2m bits, where ⋅ is upward rounding operation.
3. Experimental Results
To measure hash similarity, the distance metric called L2 norm is taken in the experiments. Let h1=h11,h21,…,hm1 and h2=h12,h22,…,hm2 be two video hashes. So the L2 norm can be calculated by(24)Dh1,h2=∑i=1mhi1−hi22,in which hi1 and hi2 are the i-th element of h1 and h2, respectively. In general, similar videos are expected to have similar video hashes and their L2 norm should be small. On the contrary, different videos should have different hashes and the corresponding L2 norm should be large. Therefore, two videos are considered as similar videos if the L2 norm of their hashes is smaller than a threshold. Otherwise, they are identified as different videos.In the following experiments, the used parameters of the proposed hashing scheme are set as follows. The input video is converted to 256 × 256 × 256, the number of video groups is 32, the block size is 16×16, and the selected dimension of MDS is 30. In other words, M = 256, N = 256, m = 32, s = N/2 = 128, t = 16, and p = 30. Therefore, the hash length of video hash is 160 bits. The robustness analysis, discriminative performance, dimension discussion, and group number selection are presented in the sections below.
3.1. Robustness Analysis
To examine robustness of the proposed hashing scheme, 100 different videos are selected from an open video database [46]. These videos are taken from different topics, including “algae,” “anemones,” “ascidians,” “bioerosion,” “black coral,” “bryozoans,” “caves,” “cleaning station,” “crustaceans,” “hurricane impacts,” “jellyfish,” “plankton,” and “seagrass.” Their frame resolutions range from 360 × 288 to 512 × 288. Some typical samples are shown in Figure 2. To produce similar videos of these 100 videos, eleven video operations are utilized to perform robustness attacks. For each operation, different parameters are selected. The used operations include brightness adjustment (8 parameters), contrast adjustment (4 parameters), 3 × 3 Gaussian low-pass filtering (10 parameters), salt and pepper noise (10 parameters), additive white Gaussian noise (AWGN) (6 parameters), MPEG-2 compression (10 parameters), MPEG-4 compression (10 parameters), random frame dropping (6 parameters), random frame insertion (6 parameters), frame scaling (8 parameters), and frame rotation (4 parameters). Table 2 presents the settings of the eleven operations. After robustness attacks, every original video has 82 similar videos. Therefore, the number of similar videos reaches 100 × 82 = 8200 and the number of total videos used in the experiment is 8200 + 100 = 8300.
Some sample videos.
Settings of eleven operations.
Operation
Parameter
Value
Number
Brightness adjustment
Photoshop’s scale
−20, −15,−10, −5, 5, 10,15, 20
8
Contrast adjustment
Photoshop’s scale
−20, −10, 10, 20
4
3 × 3 Gaussian low-pass filtering
Standard deviation
0.1, 0.2, …, 1
10
Salt and pepper noise
Density
0.001, 0.002, …, 0.01
10
AWGN
Signal noise ratio
1, 2, 3, 4, 5, 6
6
MPEG-2 compression
Kilobit per second
100,200, …, 1000
10
MPEG-4 compression
Compression quality
10,20, …, 100
10
Random frame dropping
Frame number
1, 2, 5, 10, 15, 20
6
Random frame insertion
Frame number
1, 2, 5, 10, 15, 20
6
Frame scaling
Ratio
0.8, 0.85, 0.9, …, 1.2
8
Frame rotation
Angle (degree)
−1, −0.5, 0.5, 1
4
Total
82
Hash distances under different kinds of operations are calculated. Figure 3 illustrates the mean L2 norm of different operations under specific parameter settings, where the x-axis represents the parameter value of the corresponding operation and the y-axis is the mean L2 norm. Table 3 presents the statistical results of hash distances. From these results, it can be seen that the mean distances of all video operations are smaller than 30 and most maximum distances are also smaller than 60. This means that 60 can be selected as a candidate threshold. In this case, the proposed hashing scheme can correctly identify 99.27% similar videos.
Means of L2 norms under different operations. (a) Brightness adjustment. (b) Contrast adjustment. (c) 3×3 Gaussian low-pass filtering. (d) Salt and pepper noise. (e) AWGN. (f) MPEG-2 compression. (g) MPEG-4 compression. (h) Random frame dropping. (i) Random frame insertion. (j) Frame scaling. (k) Frame rotation.
Statistical results of hash distances.
Operation
Max
Min
Mean
Standard deviation
Brightness adjustment
29.39
0.00
7.01
0.58
Contrast adjustment
26.57
0.00
8.08
0.21
3×3 Gaussian low-pass filtering
21.73
0.00
6.60
1.41
Salt and pepper noise
55.75
0.00
7.28
0.85
AWGN
80.42
4.00
18.46
0.75
MPEG-2 compression
57.34
2.00
10.89
1.61
MPEG-4 compression
87.90
1.41
10.73
3.68
Random frame dropping
73.93
4.69
28.45
3.42
Random frame insertion
74.66
4.00
27.07
3.22
Frame scaling
20.00
0.00
6.99
0.25
Frame rotation
23.15
2.45
9.21
0.39
3.2. Discriminative Performance
The dataset with 8300 videos mentioned in Section 3.1 is exploited to analyze discriminative performance of the proposed hashing scheme. Specifically, for every original video, the distances between its video hash and the hashes of other 99 × 82 = 8118 attacked videos are calculated. Note that every original video is only compared with the attacked videos of other 99 different videos. In other words, every original video is not compared with its attacked videos in the experiment. Thus, there are 100 × 8118 = 811800 L2 norms in total. Figure 4 shows the L2 norm distribution of different video hashes, where the x-axis is the L2 norm and the y-axis is the frequency of the corresponding L2 norm. It can be observed that the mean distance is 68.55, and most L2 norms are bigger than 40. If the threshold is selected as 40, 0.69% different videos are falsely detected as similar videos. If the threshold decreases to 30, the false detection rate of different videos is 0.07%. Note that a low threshold helps to improve discriminative performance, but it will also decrease robustness performance. Table 4 presents the detailed detection rates under different thresholds. In this table, the robustness performance is measured by the correct detection rate of similar videos and the discriminative performance is indicated by the false detection rate of different videos. In practice, a suitable threshold can be chosen from Table 4 according to the performance requirement of the specific application.
L2 norm distribution of different video hashes.
Detection rates under different thresholds.
Threshold
Correct detection rate of similar videos (%)
False detection rate of different videos (%)
80
99.98
86.14
70
99.82
52.70
60
99.27
20.79
50
98.01
4.93
40
95.96
0.69
30
92.79
0.07
3.3. Dimension Discussion
The selected dimension p is the only parameter of MDS. To view effect of the selected dimension on the performances of robustness and discrimination, the Receiver Operating Characteristic (ROC) graph [47] is utilized to analyze experimental results. In the graph, false positive rate (FPR) and true positive rate (TPR) are both calculated under the control of a given threshold. Therefore, some points with coordinates (FPR, TPR) can be generated by using a set of thresholds. Consequently, the ROC curve is obtained by orderly connecting these points. Detailed definitions of FPR and TPR are found in (25)FPR=number of different videos detected as similar onesnumber of different videos,(26)TPR=number of similar videos detected as similar onesnumber of similar videos.
In practice, the area under the ROC curve (AUC) is calculated to make quantitative comparison. The minimum AUC is 0 and the maximum AUC is 1. A curve with large AUC outperforms the curve with small AUC.
In this experiment, the used parameters are p = 10, p = 20, p = 30, p = 40, and p = 50. The curves of different p values are shown in Figure 5, and the AUCs of different p values are then calculated. It is found that the AUC of p = 10 is 0.99371, the AUC of p = 20 is 0.99490, the AUC of p = 30 is 0.99508, the AUC of p = 40 is 0.99508, and the AUC of p = 50 is 0.99508. The proposed hashing scheme reaches the biggest AUC when p = 30. This means that the proposed hashing scheme with p = 30 is better than the proposed hashing scheme with other p values in terms of the performances of discrimination and robustness.
Curves of different dimensions.
3.4. Group Number Selection
Effect of the group number on the performances of robustness and discrimination is also discussed. The selected group numbers are m = 8, m = 16, m = 32, and m = 64. Similarly, the ROC curves of group numbers are also calculated to make visual comparison. The ROC results are shown in Figure 6. It is found that the AUC of m = 8 is 0.96944, the AUC of m = 16 is 0.99468, the AUC of m = 32 is 0.99508, and the AUC of m = 64 is 0.99419. The proposed hashing scheme reaches the biggest AUC when m = 32. This means that the proposed hashing scheme with m = 32 is better than the proposed hashing scheme with other m values in terms of the performances of discrimination and robustness.
Curves of different group numbers.
4. Performance Comparisons
To illustrate superiority of the proposed hashing scheme, it is compared with some state-of-the-art hashing schemes. The selected hashing schemes are EOH-DCT hashing scheme [20], ST-NMF hashing scheme [28], TWT-ARM hashing scheme [30], and LRSD-DWT hashing scheme [32]. All these selected schemes are designed for videos and recently reported in the literature. In the comparisons, the videos used in Section 3.1 and Section 3.2 are exploited to test robustness and discrimination, respectively, and all videos are converted to a standard size 256 × 256 × 256 before hash calculation. Distance metrics of the compared schemes for measuring hash similarity are selected the same as the original papers. Specifically, the EOH-DCT hashing and the TWT-ARM hashing schemes select the normalized Hamming distance, the ST-NMF hashing scheme uses the Euclidean distance, and the LRSD-DWT hashing scheme chooses the Hamming distance. For the proposed hashing scheme, the experimental results under the parameters of p = 30 and m = 32 are taken for comparisons.
As different hashing schemes use different distance metrics to measuring hash similarity, it is impossible to directly present their calculated similarity results of robustness/discrimination in the same figure using a single distance metric. From the calculation of ROC curve described in Section 3.3, it can be seen that the ROC curve is a statistical result determined by a set of thresholds. It is independent of the selected distance metric. Based on this consideration, the ROC graph is also used to compare different schemes’ performances of robustness and discrimination. Figure 7 demonstrates the curves of different hashing schemes. The AUCs of different hashing schemes are calculated for quantitative comparison. The detailed results are as follows. The AUC of the EOH-DCT hashing scheme is 0.97706, the AUC of the ST-NMF hashing scheme is 0.94710, the AUC of the TWT-ARM hashing scheme is 0.99425, the AUC of the LRSD-DWT hashing scheme is 0.99076, and the AUC of the proposed hashing scheme is 0.99508. Clearly, the AUC of the proposed hashing scheme is bigger than all compared hashing schemes. It illustrates that the proposed hashing scheme outperforms all compared hashing schemes in balancing the performances of robustness and discrimination. The proposed hashing scheme makes a better AUC performance than the compared hashing schemes. This can be understood as follows. The proposed hashing scheme constructs the gradient features based high-dimensional matrix which can guarantee good robustness against digital operations and efficiently distinguish videos with different contents. It exploits MDS to learn low-dimensional features from the high-dimensional matrix which contribute to discrimination and compactness. In addition, the use of the ordinal measures can enhance robustness and discrimination of the proposed hashing scheme.
Curves of different schemes.
Time of hash generation is also examined. To this end, the total time of generating the hashes of the 100 original videos is calculated to determine the average time of a video hash. The coding language is MATLAB and the used machine is a computer workstation with a 2.1 GHz CPU and 64.0 GB RAM. The average time of the EOH-DCT hashing scheme is 7.24 seconds, the average time of the ST-NMF hashing scheme is 18.45 seconds, the average time of the TWT-ARM hashing scheme is 6.72 seconds, the average time of the LRSD-DWT hashing scheme is 37.88 seconds, and the average time of the proposed hashing scheme is 7.03 seconds. Clearly, the proposed hashing scheme is faster than the EOH-DCT hashing, ST-NMF hashing, and LRSD-DWT hashing schemes, but it runs slower than the TWT-ARM hashing scheme. Hash lengths of different schemes are compared. The length of the EOH-DCT hashing scheme is 60 bits, the length of the ST-NMF hashing scheme is 2048 bits, the length of the TWT-ARM hashing scheme is 128 bits, the length of the LRSD-DWT hashing scheme is 256 bits, and the length of the proposed hashing scheme is 160 bits. As to the performance of hash length, the proposed hashing scheme is better than the ST-NMF hashing scheme and the LRSD-DWT hashing scheme, but it is worse than the EOH-DCT hashing scheme and the TWT-ARM hashing scheme. Performances of these schemes are summarized in Table 5.
Performance summary.
Scheme
AUC
Hash length (bit)
Time (s)
EOH-DCT hashing
0.97706
60
7.24
ST-NMF hashing
0.94710
2048
18.45
TWT-ARM hashing
0.99425
128
6.72
LRSD-DWT hashing
0.99076
256
37.88
Proposed hashing
0.99508
160
7.03
5. Conclusions
A novel video hashing scheme based on MDS and OM has been proposed in this paper. In the proposed hashing scheme, a high-dimensional matrix is constructed by using gradient features in the DWT domain and then mapped to the low-dimensional features via MDS. Since MDS can preserve original relationship of high-dimensional data in the low-dimensional data, the learned low-dimensional features are discriminative and compact. In addition, the OM codes of the learned low-dimensional features are exploited to generate video hash. As the OM codes are robust and discriminative features, the use of OM can contribute to a short, robust, and discriminative video hash. Extensive experiments on 8300 videos have been performed to test the proposed scheme. The results have revealed that the proposed scheme has a good robustness and desirable discrimination. Comparisons have demonstrated that the proposed scheme is better than several state-of-the-art schemes in balancing the performances of robustness and discrimination. In the future, we will investigate video hashing schemes with other useful techniques, such as deep learning and sparse representation.
Data Availability
The dataset used to support the findings of this work can be downloaded from the public websites whose hyperlink is provided in this paper.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding publication of this paper.
Acknowledgments
This work was partially supported by the National Natural Science Foundation of China (Grant nos. 61962008, 62062013, and 61762017), Guangxi “Bagui Scholar” Team for Innovation and Research, the Guangxi Talent Highland Project of Big Data Intelligence and Application, Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, and the Innovation Project of Guangxi Graduate Education (Grant no. XYCSZ2021009).
WangX.FengL.ZhaoH.Fast image encryption algorithm based on parallel computing system201948634035810.1016/j.ins.2019.02.0492-s2.0-85062188192WangX.GaoS.Image encryption algorithm for synchronously updating Boolean networks based on matrix semi-tensor product theory2020507163610.1016/j.ins.2019.08.041WangX.GaoS.Image encryption algorithm based on the matrix semi-tensor product with a compound secret key produced by a Boolean network202053919521410.1016/j.ins.2020.06.030ChenN.XiaoH.-d.Perceptual audio hashing algorithm based on Zernike moment and maximum-likelihood watermark detection20132341216122710.1016/j.dsp.2013.01.0122-s2.0-84877581855KimH.-G.ChoH.-S.KimJ. Y.Robust audio fingerprinting using peak-pair-based hash of non-repeating foreground audio in a real environment201619131532310.1007/s10586-015-0523-z2-s2.0-84952661077TangZ.ChenL.ZhangX.ZhangS.Robust image hashing with tensor decomposition201931354956010.1109/tkde.2018.28377452-s2.0-85047013520TangZ.ZhangX.LiX.ZhangS.Robust image hashing with ring partition and invariant vector distance201611120021410.1109/tifs.2015.24851632-s2.0-84964888387OostveenJ.KalkerT.HaitsmaJ.Visual hashing of digital video: applications and Techniques20014472121131NieX.QiaoJ.LiuJ.SunJ.LiX.LiuW.LLE-based video hashing for video identificationProceedings Of IEEE International Conference On Signal Processing (ICSP)2010Beijing, China18371840YangG.ChenN.JiangQ.A robust hashing algorithm based on SURF for video copy detection2012311333910.1016/j.cose.2011.11.0042-s2.0-84855982674KhelifiF.BouridaneA.Perceptual video hashing for content identification and authentication2019291506710.1109/tcsvt.2017.27761592-s2.0-85035779831AnuranjiR.SrimathiH.A supervised deep convolutional based bidirectional long short term memory video hashing for large scale video retrieval applications202010210272910.1016/j.dsp.2020.102729ChenH.WoY.HanG.Multi-granularity geometrically robust video hashing for tampering detection20187765303532110.1007/s11042-017-4434-22-s2.0-85014521664TangZ.HuangZ.YaoH.ZhangX.ChenL.YuC.Perceptual image hashing with weighted DWT features for reduced-reference image quality assessment201861111695170910.1093/comjnl/bxy0472-s2.0-85061978957De RooverC.De VleeschouwerC.LefebvreF.MacqB.Robust video hashing based on radial projections of key frames200553104020403710.1109/tsp.2005.8554142-s2.0-27844535360CoskunB.SankurB.MemonN.Spatio-temporal transform based video hashing2006861190120810.1109/tmm.2006.8846142-s2.0-33845669994LiY.Energy based robust video hash algorithmProceedings Of IEEE International Conference on Computational Intelligence and Security (CIS)2010Xian, China433436MaoH.FengG.ZhangX. P.YaoH.A robust and fast video fingerprinting based on 3D-DCT and LSHProceedings Of 2011 International Conference on Multimedia Technology2011Hangzhou, China108111EsmaeiliM. M.FatourechiM.WardR. K.A robust and fast video copy detection system using content-based fingerprinting20116121322610.1109/tifs.2010.20975932-s2.0-79951829500SetyawanI.TimotiusI. K.Spatio-temporal digital video hashing using edge orientation histogram and discrete cosine transformProceedings Of International Conference On Information Technology Systems and Innovation (ICITSI)2014Hangzhou, China111115MucederoA.LanciniR.MapelliF.A novel hashing algorithm for video sequencesProceedings Of IEEE International Conference On Image Processing (ICIP 2004)2004Singapore22392242XiangS.YangJ.HuangJ.Perceptual video hashing robust against geometric distortions20125571520152710.1007/s11432-011-4450-12-s2.0-84862592052SunJ.WangJ.ZhangJ.NieX.LiuJ.Video hashing algorithm with weighted matching based on visual saliency201219632833110.1109/lsp.2012.21922712-s2.0-84860008804LiM.MongaV.Robust video hashing via multilinear subspace projections201221104397440910.1109/TIP.2012.22060362-s2.0-84866633311LiM.MongaV.Compact video fingerprinting via structural graphical models20138111709172110.1109/tifs.2013.22781002-s2.0-84884873635LiuX.SunJ.LiuJ.Visual attention based temporally weighting method for video hashing201320121253125610.1109/lsp.2013.22870062-s2.0-84887499078TangZ.ZhangX.ZhangS.Robust perceptual image hashing based on ring partition and NMF2014263711724NieX.ChaiY.LiuJ.SunJ.YinY.Spherical torus-based video hashing for near-duplicate video detection201659505910110.1007/s11432-016-5528-62-s2.0-84959491769SunJ.LiuX.WanW.LiJ.ZhaoD.ZhangH.Video hashing based on appearance and attention features fusion via DBN2016213849410.1016/j.neucom.2016.05.0982-s2.0-84995470381RameshnathS.BoraP. K.Perceptual video hashing based on temporal wavelet transform and random projections with application to indexing and retrieval of near-identical videos20197813180551807510.1007/s11042-019-7189-02-s2.0-85060344300TangZ.ChenL.YaoH.ZhangX.YuC.Video hashing with DCT and NMF20206371017103010.1093/comjnl/bxz060ChenL.YeD.JiangS.High accuracy perceptual video hashing via low-rank decomposition and DWTProceedings Of International Conference on Multimedia Modeling (MMM)2020Daejeon, South Korea802812LeeS.YooC. D.Robust video fingerprinting for content-based video identification200818798398810.1109/tcsvt.2008.9207392-s2.0-46349095450LianZ.GodilA.SunX.ZhangH.Non-rigid 3D shape retrieval using multidimensional scaling and bag-of-featuresProceedings Of IEEE International Conference On Image Processing (ICIP 2010)2010Hong Kong, China31813184JiangW.XuC.PeiL.YuW.Multidimensional scaling-based TDOA localization scheme using an auxiliary line201623454655010.1109/lsp.2016.25373712-s2.0-84964334038KlockH.BuhmannJ. M.Data visualization by multidimensional scaling: a deterministic annealing approach200033465166910.1016/s0031-3203(99)00078-32-s2.0-0033882729TangZ.HuangZ.ZhangX.LaoH.Robust image hashing with multidimensional scaling201713724025010.1016/j.sigpro.2017.02.0082-s2.0-85013639319FranceS. L.CarrollJ. D.Two-way multidimensional scaling: a review201141564466110.1109/tsmcc.2010.20785022-s2.0-80052069040IEEE Std754–20082008New York, NY, USAIEEE170BhatD. N.NayarS. K.Ordinal measures for visual correspondenceProceedings Of IEEE International Conference on Computer Vision and Pattern Recognition (CVPR 1996)1996San Francisco, CA, USA351357ChaiZ.SunZ.Méndez-VázquezH.HeR.TanT.Gabor ordinal measures for face recognition201491142610.1109/tifs.2013.22900642-s2.0-84891531492SunZ.TanT.Ordinal measures for iris recognition200931122211222610.1109/TPAMI.2008.2402-s2.0-70350622599KimW.LeeJ.KimM.OhD.KimC.Human action recognition using ordinal measure of accumulated motion20102010219190HuaX.ChenX.ZhangH.Robust video signature based on ordinal measureProceedings Of IEEE International Conference on Image Processing (ICIP 2004)2004Singapore685688TangZ.ZhangH.LuS.YaoH.ZhangX. Q.Robust image hashing with compressed sensing and ordinal measures20202020ReefVidFree reef video clip database2020http://www.reefvid.org/FawcettT.An introduction to ROC analysis200627886187410.1016/j.patrec.2005.10.0102-s2.0-33646023117