Discriminative Codebook Hashing for Supervised Video Retrieval

In recent years, hashing learning has received increasing attention in supervised video retrieval. However, most existing supervised video hashing approaches design hash functions based on pairwise similarity or triple relationships and focus on local information, which results in low retrieval accuracy. In this work, we propose a novel supervised framework called discriminative codebook hashing (DCH) for large-scale video retrieval. The proposed DCH encourages samples within the same category to converge to the same code word and maximizes the mutual distances among different categories. Specifically, we first propose the discriminative codebook via a predefined distance among intercode words and Bernoulli distributions to handle each hash bit. Then, we use the composite Kullback–Leibler (KL) divergence to align the neighborhood structures between the high-dimensional space and the Hamming space. The proposed DCH is optimized via the gradient descent algorithm. Experimental results on three widely used video datasets verify that our proposed DCH performs better than several state-of-the-art methods.


Introduction
Under the condition of the increase in smartphones, the amount of video data has shown an explosive growth trend [1][2][3]. For example, TikTok has over 400 million daily active users who upload approximately 2,000 videos every minute. YouTube receives a total of 100 hours of videos per minute [4][5][6]. Due to the economic storage and efficiency of binary codes, hash-based methods have been widely applied to visual retrieval tasks [7][8][9][10][11][12][13].
Previous hash-related work [14] mainly focused on image hashing and can be divided into data-independent and data-dependent methods. Data-independent approaches learn binary codes without data information but through random space projection. e most representative algorithm is local sensitive hashing (LSH) [15], which generates huge redundant information using random mapping and obtains satisfactory performance with long hash codes. Data-dependent hash methods [16][17][18], which can also be divided into unsupervised hashing and supervised hashing, are proposed to generate more efficient hash codes by maintaining the neighborhood structure between data. For example, Gong et al. [19] proposed iterative quantization hashing (ITQ), which minimizes quantization error by rotating principal component analysis (PCA) projection data. Spectral hashing (SH) [20] assumes that data obey a uniform distribution and divides the data according to the main direction of the data stream. Density sensitive hashing (DSH) [21] extends LSH by studying structural information. Zhang et al. [22] developed a convergence-preserving parametric learning algorithm, called latent factor hashing (LFH), to learn similarity-preserving binary codes based on latent factor models. Liu et al. [23] proposed kernel supervised hashing (KSH) by applying kernel-based formulas to accommodate linearly inseparable data and designed a greedy algorithm to solve the hash function optimization problem.
In recent years, hashing methods proposed for video retrieval have also received extensive attention [24][25][26][27][28][29][30][31] and are composed of two categories: machine learning methods and deep hashing. Machine learning methods, resembling image hashing approaches, learn binary codes of video keyframes based on the low-level manual features and then calculate video hashing codes via averaging. Wu et al. [4] employed video hashing via using color histograms to obtain global features. is is the first application of hash learning in the video field. Multiple-feature hashing (MFH) [32] adopts the weight-based method to combine different features. Ye et al. [33] used video structural information in the supervised learning paradigm to obtain the optimal binary codes. Stochastic multiview hashing (SMVH) [34] attempts to separately calculate the probability similarity matrices of video frames in the feature space and the Hamming space, and then, the difference between the above two probability matrices is minimized using the KL divergence. Nie et al. [35] defined joint multiview hashing (JMVH) by maximizing the interclass distance and minimizing the innerclass distance to preserve the global structure and local structure with multiple features. Boosting temporal video hashing (BTVH) [36] studies the multitable learning problem to boost the performance and captures the inherent similarity of video from both visual and temporal perspectives. In addition, some researchers in recent years have used deep networks to obtain the temporal and spatial information between keyframes. For instance, central similarity quantization (CSQ) [37] learns the temporal information by using 3D convolutional neural networks and proposes a view point called hash center to enhance the central similarity.
However, most existing video hashing approaches may lead to the following problems. (1) Low discriminability among different categories: functions based on pairwise similarity or triple relationships only consider local information, which results in good maintenance of the information of similar samples but shows poor performance in distinguishing samples from different categories. (2) Poor performance in real-world scenarios: in real application scenarios, similar data often accounts for only a small proportion, and most samples are not similar, which leads to low efficiency when the data are imbalanced [37]. (3) Greater time costs on deep learning: deep learning frameworks are time-consuming when training models and have no significant performance based on the spatiotemporal information extracted by the network. Hence, these video hashing functions cannot learn discriminative hash codes to enhance the performance.
To solve the above problems, in this work, we propose a novel framework for supervised video retrieval, called discriminative codebook hashing, which considers the global structure to construct the hash function. DCH encourages samples within the same category to converge to the identical codeword and maximizes the mutual distances between different categories. Specifically, the discriminative codebook is first generated based on two characters: the predefined distance between intercode words and Bernoulli distributions for ensuring that each hash bit stores more information. en, to keep the similarity matrix between the feature space and the Hamming space, the composite KL divergence is proposed to solve this problem. Finally, the gradient descent algorithm is utilized to optimize the algorithm. In this way, we can obtain discriminative binary codes for video retrieval. Figure 1 shows the framework of DCH, and the method we proposed has the following innovations: (i) We proposed the discriminative codebook based on the predefined distance between intercode words and Bernoulli distributions for ensuring each hash bit to store more information (ii) e DCH method, which can maximize the distance of the intercode words generated by the predefined codebook to learn discriminative binary codes for supervised video retrieval, is proposed (iii) We verify our proposed method by experimenting on three widely used datasets, which shows that DCH has a significant improvement in contrast with several state-of-the-art methods e other sections are organized as follows. Section 2 introduces some preliminary works. Section 3 introduces the proposed discriminative codebook hashing in detail. e experimental work is presented in Section 4, and the conclusion of DCH is shown in Section 5.

Preliminary Work
In this section, we briefly introduce the preliminary work, namely, stochastic multiview hashing [34]. It is a supervised video retrieval method that aims to preserve the similarity structure from the original space to the Hamming space.
Let , where x i ∈ R 1×d , n is the number of keyframes, and d is the dimension of each keyframe. Z � z i n i�1 represents the corresponding binary codes of the keyframes, where z i ∈ R 1×l . e conversion relationships between the above variables are formulated as where Z ∈ R n×l is the temporal result of linear projection, b ∈ R l is a bias parameter, W ∈ R d×l is the projection matrix, Ind i is the set of frames, and |Ind i | is the sum of samples in the set. e high-dimensional keyframe feature matrix X is first projected into the lower matrix Z. en, the sigmoid function is used to map the variable between 0 and 1. Finally, a thresholding function is used to change the data into a binary code with T(y) � 0 if y < 0.5 and T(y) � 1, otherwise. SMVH keeps the similarity matrix between the feature space and the Hamming space using a composite KL divergence measure. In particular, it separately calculated the similarity probability matrix P in the original space and the 2 Computational Intelligence and Neuroscience pairwise similarity matrix Q among samples in the Hamming space. en, the KL divergence is used to examine how well the above two probability matrices P and Q match. erefore, the objective function of SMVH is defined as follows: where μ > 0 controls the weight of the regular term to prevent overfitting and S KL (W, b) is the composite KL divergence. e latter can be represented as where 0 ≤ λ ≤ 1 controls the influence of the composite KL divergence, P � p i n i�1 ∈ R n×n is the similarity structure based on X, and Q � q i n i�1 ∈ R n×n is another probability matrix preserving the similarity information of Z in the Hamming space. In addition, the KL divergence is defined as follows: where p j|i is a conditional probability that reflects the similarity between x i and x j , and another conditional probability q j|i represents the probability of returning z j given the query z i .

Discriminative Codebook Hashing
In this section, we present the proposed DCH in detail through four parts, including the proposed discriminative codebook, the objective function, algorithmic optimization, and complexity analysis.

Discriminative Codebook. Motivated by CSQ [37]
, we propose a novel and discriminative codebook C � c i m i�1 for supervised video retrieval, where c i ∈ 0, 1 { } 1×l is the code word of the ith category. e proposed codebook is defined according to two characters. e first is that the value in the same bit of different code words obeys a Bernoulli distribution. Specifically, the proportions of 0 and 1 of the same bit in different categories are both 50%, that is, c ·i has a 50% probability of being 0 or 1, which will maximize the entropy and store more information in each bit. e other is that the mutual distances among intercode words are defined as follows: where D H is the Hamming distance between code words c i and c j , l is the length of binary codes, and f represents the fault tolerance. e mutual distance between intercode words will be the largest constrained by equation (7).
Overall, the proposed codebook encourages samples within the same category to converge to the same codeword and maximizes the mutual distance between different categories. erefore, the proposed codebook can preserve global structures and help generate discriminative binary codes for video retrieval. e scheme of the proposed discriminative codebook is presented in Algorithm 1.

Objective Function.
According to the proposed discriminative codebook C, we expand each row of the codebook matrix C into R � r i n i�1 according to the number of samples, where r i ∈ R 1×l . e detailed generation process of R is shown in Algorithm 2. We minimize the error between the binary codes and the predefined codebook as

DCH Function
Hash Codes 0 1 1 … 1 0 1 1  Figure 1: e framework of DCH. We divide the entire experiment into two steps, namely, offline learning and online retrieval. In the offline phase, we join keyframe features and predefined codebook to learn hash functions. In the online phrase, we map the query video into a set of binary codes through hash functions. Next, we use the exclusive or (XOR) operation to obtain the Hamming distance between the query video and samples in the database. Finally, we take videos with the shortest Hamming distance as the video retrieval results.
Computational Intelligence and Neuroscience 3 Specifically, for each z i ∈ Z, we take r i as the codebook of z i ∈ Z to make samples in the same category share the same codebook and samples in different categories have discriminative binary codes.
To keep the similarity matrix between the feature space and the Hamming space, we join the composite KL divergence and our proposed codebook to construct the overall objective function of DCH as follows: where c controls the weight of the error loss between the codebook and the learned hash codes, and the second term of equation (9) aligns values between binary codes and their corresponding code word.
In this way, our proposed DCH can solve the problem that other algorithms only consider the pairwise relationships and ensure that samples in the same category share the same code word. Furthermore, DCH maximizes the mutual distances between different categories and then obtains discriminative binary codes.

Algorithmic Optimization.
e optimization problem has two main variables: W and b. Our solution is to use the gradient descent algorithm to find good solutions. To facilitate the writing, we split the objective function equation (9) into three parts: e detailed optimization procedure is presented as follows.
W-Step: the corresponding problem is to minimize the following loss function: Input: the number of categories m; the number of samples per category n i ; code length l; maximum number of iterations T c ; fault tolerance rate f. (6) if any two rows of C satisfy equation (7)  (7) break (8) end (9) end ALGORITHM 1: Discriminative codebook.
Input: training data X ∈ R n×d ; codebook C ∈ R m×l ; maximum number of iterations T; code length l; parameters λ, μ, c; learning rate α; Output: hash codes H ∈ 0, 1 { } n v ×l . (1) Initialization: initialize the projection matrix W and bias matrix b as a random matrix and vector.
(2) Generating R according to the number of samples:  Computational Intelligence and Neuroscience To compute the optimal W, the relevant deviation formula can be expressed as e derivative of zΦ 1 (W, b) w.r.t. W can be computed as follows: where zΦ 1 (W, b)/zz ik and zz ik /zw kj are represented as Following the norm derivation law, zΦ 2 (W, b)/zW can be optimized as follows: where ⊙ indicates that the elements in the same position of two matrices are multiplied.
For zΦ 3 (W)/zW, we have the derivative that b-Step: the subproblem of b is given by e deviation w.r.t. b can be expressed as e derivative of zΦ 1 (W, b)/zb is described as follows: where e second term of equation (18) is described as follows: Algorithm 2 describes the overall algorithm optimization process of the proposed DCH.

Complexity Analysis.
e time complexity of the entire training process of SMVH [34] is approximately O(Tn 3 + n 2 ), and the proposed DCH algorithm adds two parts time-consuming on this basis. e first part is the learning process of C, and the time complexity is O(T c l). e second part is that the time complexity of optimizing equations (15) and (21) together is O(dnl) in each iteration. erefore, the overall time complexity of DCH is O(n 2 + T c l + T(n 3 + dnl)). In this work, time complexities O(T c l) and O(dnl) can be ignored due to T c , l, d ≪ n so that our complexity is nearly O(Tn 3 + n 2 ). Additionally, the calculation of the hash codes is a linear projection with a time complexity of approximately O(1), and the online search can be performed by XOR operations. Although the algorithm proposed in this paper adds a constraint on SMVH, the maximum number of iterations T directly affects the time complexity of the algorithm. It can be proven in subsequent experiments that DCH can converge in fewer iterations. us, the time complexity of DCH is in a reasonable range.

Experiments
In this section, we first introduce the datasets used in this paper, and then, the baselines and some experimental details will be introduced. Finally, we present the experimental results. [4] is the most useful dataset in near-duplicate video retrieval (NDVR) research, which contains data from YouTube, Google, and Yahoo.

Datasets. CC_WEB_VIDEO
ere are 12,877 videos that are divided into 24 sets, and keyframes are extracted by a uniform sampling method to represent the video. Since some videos do not have label information, we take 3,482 videos with labels as the experimental dataset. In each category, we select 70% of the video data as the training set and the remainder as the testing set. We extract 10 keyframes for each video uniformly and Computational Intelligence and Neuroscience 5 extract 4096-dimensional features to represent keyframes by using the pretrained VGG-19 network. HMDB51 [38] contains 6,766 human action videos selected from movies and some other public sources such as YouTube. e dataset is divided into 51 categories, and each of them includes approximately 100 clips. In each category, we randomly select 45 video samples. Of these, 25 videos are added to the training set and the rest are select to the testing set. We uniformly extract 10 keyframes for each video, and the VGG-19 pretraining network is used to extract the 4096dimensional deep features.
UCF101 [39] contains 13,320 videos which has been divided into 101 human behavior categories, such as sports, instruments, character interactions, and others used for action recognition. We randomly select 70 videos in each category to join the training set, and 30 videos to join the testing set. For each video, 10 keyframes are uniformly selected to represent the video. We use VGG-19 to extract the 4096-dimensional features for each keyframe.

Baselines.
Several state-of-the-art hash functions, including ITQ [19], SH [20], DSH [21], LFH [22], KSH [23], JMVH [35], and SMVH [34], are used for comparison. Among these methods, ITQ, SH, and DSH are unsupervised hashing methods, while LFH, KSH, JMVH, and SMVH are supervised hashing methods. For the comparative test, we use the source codes published to conduct the experiment. JMVH and SMVH can also be used for multiview video retrieval, but in this paper, we only test these methods as a single view method. It is worth noting that all the experimental results are obtained in MATLAB R2016a on the same computer with an Intel Core i7-6700 CPU @ 3.40 GHz, 72 GB RAM and the 64 bit Windows 10 operating system.

Evaluation Metrics.
We use four popular evaluation metrics to comprehensively evaluate experimental results. e mean average precision (mAP) is widely used in the retrieval field. e higher the mAP score is, the better the retrieval performance of the method is. e precision@K curve represents the precision accuracy versus the first K retrieved samples, where precision represents the proportion of the number of retrieved correct videos to the total number of retrieved videos. e recall@K curve represents the average recall rate versus the first K retrieved samples, where recall represents the proportion of the correct video volume retrieved in all near-duplicate video samples. e precision-recall (PR) curve is an index used to evaluate reliability and is widely used in the fields of medicine and machine learning.

Parameter Selection.
We have three model parameters, including λ, μ, and c, and the number of iterations T. According to SMVH [34], we set λ � 0.9 and μ � 0.01. Figure 2(a), when c is in the range of 0.05 to 1, the results are stable across three different datasets. erefore, we empirically choose c � 1 in our proposed model. e maximum number of iterations T determines the training time cost and the performance, so it is worth discussing. Figure 2(b) shows the effect of the maximum iterations T in the range of 100 to 1400 on mAP performance. For HMDB51, it can be seen that the best mAP is generated with T � 800 before decreasing. However, in the other two datasets, T � 800 is not an optimal experimental result. erefore, after comprehensive consideration, T � 1000 is set as the final parameter setting. Table 1 shows the mAP results for different lengths of hash codes on the three datasets, and the results of other evaluation metrics are shown in Figures 3-5. We will give the detailed analysis of all results of the three datasets in the following parts.

Results and Discussion.
According to Table 1, for the CC_WEB_VIDEO dataset, the mAPs are very high because the dataset is movie clips, and videos of the same category are nearduplicate videos. As shown in Table 1, the performance of the proposed DCH is at least 1.85% better than that of the other methods from 32 to 64 bits. When the code length is 96 bits, the mAP of DCH is slightly lower than that of LFH. As shown in Figure 3, the experimental results of our method in precision@K and recall@K are equal to or slightly higher than those of most other methods. Besides, as the code length increases, the performance of our proposed DCH gradually surpasses that of other methods. Figures 3(i)-3(l) show that the area surrounded by DCH is gradually increasing. Table 1 shows that our proposed DCH performs better than other hash methods in most cases in the HMDB51 dataset. Although the mAP performance of the JMVH method surpasses 2.39% over that of DCH with 32 bits, the mAPs of our proposed DCH are better than those of the other comparison methods in the subsequent experiments. Figure 4 shows that when the length of hash codes is larger than 32 bits, regardless of whether precision@K curve, recall@K curve, or PR curve is used, DCH has excellent performance compared with other methods in all metrics for the precision@K curve, recall@K curve, and PR curve.
For the UCF101 dataset, DCH obtained the optimal experimental results in the range of [32,48,64] bits. It is worth noting that the size of the UCF101 dataset is relatively large, and SMVH cannot obtain discriminative video hash when the hash code length is very small. erefore, SMVH has no experimental results available for l � 32 and l � 48. As shown in Figure 5, the performance of DCH is much higher than those of some of the methods except JMVH. We can see that the recall rate of DCH for positive samples is slightly lower than that of JMVH based on      Computational Intelligence and Neuroscience

Conclusion
In this paper, we propose a novel supervised video hashing framework, termed discriminative codebook hashing, which can generate discriminative binary codes for video retrieval. e proposed DCH encourages samples within the same category to converge to the same code word and maximizes the mutual distances between different categories. Specifically, we generate a discriminative codebook to distinguish between samples of different categories more accurately. Extensive experimental results prove that the performance of DCH is significantly improved compared to several state-of-the-art methods. In future work, we will use a smaller matrix storing the similarity information between samples to avoid consuming considerable training time and space when the amount of data is large. is will improve the performance of the model while reducing the time complexity.

Conflicts of Interest
e authors declare that there are no conflicts of interest in the publication of this paper.