Multimedia hashing is a useful technology of multimedia management, e.g., multimedia search and multimedia security. This paper proposes a robust multimedia hashing for processing videos. The proposed video hashing constructs a high-dimensional matrix via gradient features in the discrete wavelet transform (DWT) domain of preprocessed video, learns low-dimensional features from high-dimensional matrix via multidimensional scaling, and calculates video hash by ordinal measures of the learned low-dimensional features. Extensive experiments on 8300 videos are performed to examine the proposed video hashing. Performance comparisons reveal that the proposed scheme is better than several state-of-the-art schemes in balancing the performances of robustness and discrimination.

In the digital era, multimedia (e.g., image and video) is easily captured via smart devices, such as smart phone and iPad. Many people would like to use multimedia to record their lives and share them with friends on the Internet. Consequently, there are a large number of multimedia data in the cloud server. Efficient technologies of multimedia management, e.g., multimedia search and multimedia security [

Generally, multimedia hashing for videos should identify visually similar videos which are generated by manipulating videos with normal digital operations, such as compression and filtering. This is the property of video hashing called robustness. As there are many different videos in practical applications, video hashing should meet the property called discrimination. This property can ensure that video hashing can efficiently distinguish different videos from massive videos. Note that the discrimination and robustness are two basic properties of video hashing. For specific applications, video hashing should satisfy additional property. For example, it should be key-dependent for video authentication and forensics.

Many scholars have designed diverse video hashing schemes in the past years. As discrete cosine transform (DCT) has been widely used in some compression techniques, such as JPEG compression and MPEG-2 compression, it is extensively investigated in video hashing design. A well-known robust video hashing was introduced by De Roover et al. [

Besides DCT, other useful techniques are also used in video hashing. For example, Mucedero et al. [

Recently, motivated by the ring partition reported in [

From the above survey, it can be found that many reported video hashing schemes can make good robustness against some digital operations. But they do not reach a desirable balance between the performances of robustness and discrimination yet. To address this issue, we jointly exploit DWT, gradient information, MDS, and ordinal measures to develop a novel video hashing, which can make a good balance between the two performances. Compared with the existing video hashing schemes, the main contributions of the proposed video hashing are presented as follows.

High-dimensional matrix is constructed by using gradient features in the DWT domain of preprocessed video. Since gradient information can measure structural image features which are almost kept unchanged after digital operations, gradient features can effectively capture visual content of video frame. Therefore, the gradient features based high-dimensional matrix can guarantee good robustness against digital operations and distinguish videos with different contents.

Low-dimensional features are learned from the high-dimensional matrix via MDS. The MDS is an efficient technique of data dimensionality reduction. It can effectively learn discriminative low-dimensional features from the high-dimensional data by preserving original relationship of the data. So the learned low-dimensional features can make discriminative and compact video hash.

Video hash is generated by using ordinal measures of the low-dimensional features. The ordinal measures are robust and discriminative features, and their elements are all integers. Therefore, the use of the ordinal measures can contribute to a robust and discriminative video hash with short length.

Extensive experiments are performed to test the proposed video hashing. The results reveal that the proposed video hashing has a good robustness and desirable discrimination. Comparisons with several state-of-the-art hashing schemes demonstrate that the proposed video hashing is better than the compared schemes in balancing the performances of robustness and discrimination. The rest of the paper is structured as follows. The proposed video hashing is explained in Section

The proposed video hashing can be decomposed into four components. Figure

Components of the proposed scheme.

Temporal-spatial resampling is applied to the input video. Specifically, the temporal resampling is firstly used to map different videos to the same frame number. To do so, the pixels in the same positions of all frames are orderly picked to form a pixel tube. Every pixel tube is then mapped to a fixed length

If the input video is an RGB color video, the resized video is converted to the well-known color space called HSI space, and the intensity color component “

Structural features are important image features, which can effectively describe the visual content of video frame and are almost kept unchanged after digital operations. Since image gradient [_{l} be the secondary frame of the _{l} (_{l} can be determined by

After these operations, there are

Next, one-level 2D DWT is applied to each secondary frame. Note that four sub-bands are obtained after decomposition, i.e., LL sub-band, LH sub-band, HL sub-band, and HH sub-band. As DWT coefficients in the LL sub-band contain approximation information of secondary frame, the LL sub-band is used to construct high-dimensional matrix. Moreover, the DWT coefficients in the LL sub-band are slightly influenced by compression and noise. Consequently, features extracted from LL sub-band can make a robust high-dimensional matrix. Suppose that the size of the LL sub-band of one-level 2D DWT is

Let

As the orientation of image structure is changed after rotation, the gradient magnitude is selected as the feature instead of the orientation.

After the calculation of gradient magnitude, a gradient feature matrix sized _{l} can be generated by arranging these means as follows:

Finally, a high-dimensional matrix can be constructed by stacking these

Note that the size of the high-dimensional feature matrix is ^{2}.

In order to find low-dimensional data from the high-dimensional feature matrix

(1)

(2)

(3)

The size of the low-dimensional matrix

To generate a short and discriminative video feature sequence, the variance of each row of the low-dimensional matrix

After the variance calculation, a video feature sequence is obtained as follows:

Note that all elements of the feature sequence are floating-point numbers.

In practice, storage of a floating-point number requires many bits in a computer system. For example, 32 bits are needed in terms of the IEEE standard [^{st} element of the original data sequence is 20, which is ranked at the 6^{th} position in the sorted sequence. Therefore, its feature code of the OM technique is 6. The OM codes of other elements can be determined by similar calculation.

An example of the OM technique.

Position | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|

Original sequence | 20 | 18 | 22 | 10 | 7 | 15 | 25 | 12 |

Sorted sequence | 7 | 10 | 12 | 15 | 18 | 20 | 22 | 25 |

Ordinal measures | 6 | 5 | 7 | 2 | 1 | 4 | 8 | 3 |

Here, the OM codes of the elements of the feature sequence

To measure hash similarity, the distance metric called L2 norm is taken in the experiments. Let

To examine robustness of the proposed hashing scheme, 100 different videos are selected from an open video database [

Some sample videos.

Settings of eleven operations.

Operation | Parameter | Value | Number |
---|---|---|---|

Brightness adjustment | Photoshop’s scale | −20, −15,−10, −5, 5, 10,15, 20 | 8 |

Contrast adjustment | Photoshop’s scale | −20, −10, 10, 20 | 4 |

3 × 3 Gaussian low-pass filtering | Standard deviation | 0.1, 0.2, …, 1 | 10 |

Salt and pepper noise | Density | 0.001, 0.002, …, 0.01 | 10 |

AWGN | Signal noise ratio | 1, 2, 3, 4, 5, 6 | 6 |

MPEG-2 compression | Kilobit per second | 100,200, …, 1000 | 10 |

MPEG-4 compression | Compression quality | 10,20, …, 100 | 10 |

Random frame dropping | Frame number | 1, 2, 5, 10, 15, 20 | 6 |

Random frame insertion | Frame number | 1, 2, 5, 10, 15, 20 | 6 |

Frame scaling | Ratio | 0.8, 0.85, 0.9, …, 1.2 | 8 |

Frame rotation | Angle (degree) | −1, −0.5, 0.5, 1 | 4 |

Total | 82 |

Hash distances under different kinds of operations are calculated. Figure

Means of L2 norms under different operations. (a) Brightness adjustment. (b) Contrast adjustment. (c) 3×3 Gaussian low-pass filtering. (d) Salt and pepper noise. (e) AWGN. (f) MPEG-2 compression. (g) MPEG-4 compression. (h) Random frame dropping. (i) Random frame insertion. (j) Frame scaling. (k) Frame rotation.

Statistical results of hash distances.

Operation | Max | Min | Mean | Standard deviation |
---|---|---|---|---|

Brightness adjustment | 29.39 | 0.00 | 7.01 | 0.58 |

Contrast adjustment | 26.57 | 0.00 | 8.08 | 0.21 |

3×3 Gaussian low-pass filtering | 21.73 | 0.00 | 6.60 | 1.41 |

Salt and pepper noise | 55.75 | 0.00 | 7.28 | 0.85 |

AWGN | 80.42 | 4.00 | 18.46 | 0.75 |

MPEG-2 compression | 57.34 | 2.00 | 10.89 | 1.61 |

MPEG-4 compression | 87.90 | 1.41 | 10.73 | 3.68 |

Random frame dropping | 73.93 | 4.69 | 28.45 | 3.42 |

Random frame insertion | 74.66 | 4.00 | 27.07 | 3.22 |

Frame scaling | 20.00 | 0.00 | 6.99 | 0.25 |

Frame rotation | 23.15 | 2.45 | 9.21 | 0.39 |

The dataset with 8300 videos mentioned in Section

L2 norm distribution of different video hashes.

Detection rates under different thresholds.

Threshold | Correct detection rate of similar videos (%) | False detection rate of different videos (%) |
---|---|---|

80 | 99.98 | 86.14 |

70 | 99.82 | 52.70 |

60 | 99.27 | 20.79 |

50 | 98.01 | 4.93 |

40 | 95.96 | 0.69 |

30 | 92.79 | 0.07 |

The selected dimension

In practice, the area under the ROC curve (AUC) is calculated to make quantitative comparison. The minimum AUC is 0 and the maximum AUC is 1. A curve with large AUC outperforms the curve with small AUC.

In this experiment, the used parameters are

Curves of different dimensions.

Effect of the group number on the performances of robustness and discrimination is also discussed. The selected group numbers are

Curves of different group numbers.

To illustrate superiority of the proposed hashing scheme, it is compared with some state-of-the-art hashing schemes. The selected hashing schemes are EOH-DCT hashing scheme [

As different hashing schemes use different distance metrics to measuring hash similarity, it is impossible to directly present their calculated similarity results of robustness/discrimination in the same figure using a single distance metric. From the calculation of ROC curve described in Section

Curves of different schemes.

Time of hash generation is also examined. To this end, the total time of generating the hashes of the 100 original videos is calculated to determine the average time of a video hash. The coding language is MATLAB and the used machine is a computer workstation with a 2.1 GHz CPU and 64.0 GB RAM. The average time of the EOH-DCT hashing scheme is 7.24 seconds, the average time of the ST-NMF hashing scheme is 18.45 seconds, the average time of the TWT-ARM hashing scheme is 6.72 seconds, the average time of the LRSD-DWT hashing scheme is 37.88 seconds, and the average time of the proposed hashing scheme is 7.03 seconds. Clearly, the proposed hashing scheme is faster than the EOH-DCT hashing, ST-NMF hashing, and LRSD-DWT hashing schemes, but it runs slower than the TWT-ARM hashing scheme. Hash lengths of different schemes are compared. The length of the EOH-DCT hashing scheme is 60 bits, the length of the ST-NMF hashing scheme is 2048 bits, the length of the TWT-ARM hashing scheme is 128 bits, the length of the LRSD-DWT hashing scheme is 256 bits, and the length of the proposed hashing scheme is 160 bits. As to the performance of hash length, the proposed hashing scheme is better than the ST-NMF hashing scheme and the LRSD-DWT hashing scheme, but it is worse than the EOH-DCT hashing scheme and the TWT-ARM hashing scheme. Performances of these schemes are summarized in Table

Performance summary.

Scheme | AUC | Hash length (bit) | Time (s) |
---|---|---|---|

EOH-DCT hashing | 0.97706 | 60 | 7.24 |

ST-NMF hashing | 0.94710 | 2048 | 18.45 |

TWT-ARM hashing | 0.99425 | 128 | 6.72 |

LRSD-DWT hashing | 0.99076 | 256 | 37.88 |

Proposed hashing | 0.99508 | 160 | 7.03 |

A novel video hashing scheme based on MDS and OM has been proposed in this paper. In the proposed hashing scheme, a high-dimensional matrix is constructed by using gradient features in the DWT domain and then mapped to the low-dimensional features via MDS. Since MDS can preserve original relationship of high-dimensional data in the low-dimensional data, the learned low-dimensional features are discriminative and compact. In addition, the OM codes of the learned low-dimensional features are exploited to generate video hash. As the OM codes are robust and discriminative features, the use of OM can contribute to a short, robust, and discriminative video hash. Extensive experiments on 8300 videos have been performed to test the proposed scheme. The results have revealed that the proposed scheme has a good robustness and desirable discrimination. Comparisons have demonstrated that the proposed scheme is better than several state-of-the-art schemes in balancing the performances of robustness and discrimination. In the future, we will investigate video hashing schemes with other useful techniques, such as deep learning and sparse representation.

The dataset used to support the findings of this work can be downloaded from the public websites whose hyperlink is provided in this paper.

The authors declare that there are no conflicts of interest regarding publication of this paper.

This work was partially supported by the National Natural Science Foundation of China (Grant nos. 61962008, 62062013, and 61762017), Guangxi “Bagui Scholar” Team for Innovation and Research, the Guangxi Talent Highland Project of Big Data Intelligence and Application, Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing, and the Innovation Project of Guangxi Graduate Education (Grant no. XYCSZ2021009).