A Cross-Modal Image and Text Retrieval Method Based on Efficient Feature Extraction and Interactive Learning CAE

In view of the complexity of the multimodal environment and the existing shallow network structure that cannot achieve highprecision image and text retrieval, a cross-modal image and text retrieval method combining efficient feature extraction and interactive learning convolutional autoencoder (CAE) is proposed. First, the residual network convolution kernel is improved by incorporating two-dimensional principal component analysis (2DPCA) to extract image features and extracting text features through long short-term memory (LSTM) and word vectors to efficiently extract graphic features. +en, based on interactive learning CAE, cross-modal retrieval of images and text is realized. Among them, the image and text features are respectively input to the two input terminals of the dual-modal CAE, and the image-text relationship model is obtained through the interactive learning of the middle layer to realize the image-text retrieval. Finally, based on Flickr30K, MSCOCO, and Pascal VOC 2007 datasets, the proposed method is experimentally demonstrated.+e results show that the proposed method can complete accurate image retrieval and text retrieval. Moreover, the mean average precision (MAP) has reached more than 0.3, the area of precisionrecall rate (PR) curves are better than other comparison methods, and they are applicable.


Introduction
With the advancement of digitalization, more and more people use the Internet to obtain the information they need. How to make users accurately and quickly search for the information they need has become a hot issue [1]. In the era of mobile Internet, each of us is receiving massive amounts of information from the Internet, while at the same time generating massive amounts of multimedia information, that is, multimodal data [2]. e original form of crossmodal retrieval is similar to that of single-mode retrieval. With the growth of multimodal data, it is more difficult for users to retrieve the information they are interested in efficiently and accurately [3].
ere are many retrieval methods so far, most of which are based on a single modality, such as searching for articles by text, searching for pictures by pictures, or multimodal search on the surface. In fact, it is in the form of search keywords to query and request the most matching content among many resources on the Internet.
In order to meet people's actual needs and provide better retrieval services, scholars are committed to the research on relevant methods and practice in the field of cross-modal retrieval. erefore, the cross-modal retrieval method has a wide range of application scenarios and research significance. How to mine the effective information in these multimodal data is an important problem in the research field of cross-modal retrieval.
Researchers found a semantic gap between the low-level features of data and high-level semantics, and the data of different modalities are heterogeneous [4,5]. It can be seen that the core of cross-modal retrieval research is to mine the associated information between different modal data. How to mine this associated information has become the key to the research of cross-modal retrieval technology.
In recent years, with the rapid development of deep learning technology, people have become more and more capable of solving more complex machine learning problems and have made great progress in analyzing and processing multimodal data [6]. Multimodal content analysis has broad application prospects in various fields such as smart cities, smart homes, and smart transportation. Based on the breakthrough progress in the application research of deep learning in the monomodal field, it is applied to the theoretical research of cross-modal retrieval tasks, and technical practice is provided at the same time [7]. e current cross-modal retrieval system modelling mainly solves two problems: one is how to complete the unified mapping of different modal information features and the second is how to ensure the retrieval rate on the basis of improving the retrieval rate of retrieval models [8]. ese two problems are interdependent. Due to the diversity and heterogeneity of different modal information, the feature extraction method and unified representation form of each modal become the key to solving the problem [9,10]. In addition, the corpus with three modalities and above is less researched, and the corpus with two modalities is more common. In particular, the corpus with the modal alignment of images and text is more common.

Related Research
Because there is a huge heterogeneous gap in different modal data, how to effectively measure the content similarity of different modal data has become a major challenge [11]. Nowadays, many cross-modal retrieval methods have been proposed [12].

Real-Valued Cross-Modal Retrieval Method.
Cross-modal retrieval methods based on real-valued representation can generally be divided into two categories: canonical correlation analysis (CCA) and deep learning [13]. CCA uses different modal data to form sample pairs, learns a projection matrix, and projects different modal data to a common latent subspace, and then in the subspace, measures the similarity between modal data [14]. Reference [15] proposed a new multilabel kernel canonical correlation analysis (ml-KCCA) method for cross-modal retrieval, which uses the high-level semantic information reflected in multilabel annotations to enhance the kernel CCA. Reference [16] proposed cross-media correlation learning with deep canonical correlation analysis (CMC-DCCA). It can better mine the complex correlation between cross-media data and achieve better cross-media retrieval performance. However, the performance of its feature extraction algorithm highly depends on the size of the sample set, and it is difficult to obtain training samples for noncooperative targets in actual situations. How to efficiently set the parameter range still needs further exploration. e cross-modal retrieval method based on deep learning makes full use of the powerful feature extraction capabilities of deep learning models, learns the feature representation of different modal data, and then establishes semantic associations between modalities at a high level [17]. Reference [18] proposed a two-stage deep learning method for supervised cross-modal retrieval, extending the traditional norm-related analysis from 2 views to 3 views and conducting supervised learning in two stages. e evaluation results on two publicly available datasets show that the proposed method has a better performance. However, there is still room for optimization for the detection accuracy of complex retrieval environments. At present, the dimensionality obtained by the representation learning model when automatically extracting features is relatively high. Particularly for the cross-modal retrieval model based on deep learning, the sample feature dimension obtained in the representation stage is usually not less than 4096, and the final feature dimension is still too high [19]. Reference [20] proposed an image retrieval method combining deep Boltzmann machine (DBM) and CNN to extract high-order semantic features of the image.

Cross-Modal Retrieval Method Based on Hash
Transformation. e cross-modal retrieval method based on real-valued representation has the problems of time-consuming calculation and large demand space when facing large-scale data. erefore, an information retrieval method based on hash transformation appears. is method is based on the paired sample pairs of different modal data, learns the corresponding hash transformation, maps the corresponding modal data features to the Hamming binary space, and then realizes faster cross-modal retrieval in this space [21]. e premise of hash transformation is that the hash codes of similar samples are also similar. Reference [22] proposed a method called DNDCMH. is algorithm uses binary vectors specifying the existence of specific facial attributes as input queries to retrieve relevant facial images from the database. Secondly, the dimension reduction methods such as principal component analysis (PCA) can reduce the feature dimension to a certain extent, but under the premise of maintaining the necessary retrieval accuracy, the dimension that can be reduced is quite limited and lacks efficient and reasonable retrieval mechanism that can adapt to large-scale image sets [23]. Reference [24] proposed a new self-supervised deep multimodal hashing (SSDMH) method. However, crossmodal retrieval still only realizes the matching of image content and subject words, ignoring a large amount of content-based, subtle, and important image information [25]. Reference [26] proposed a deep hashing method that can combine stacked convolutional autoencoders with hash learning and hierarchically map the input image to a lowdimensional space. Some additional relaxation constraints are added to the objective function to optimize the hash algorithm. Experimental results on ultra-high-dimensional image datasets show that the proposed method has good stability in cross-modal retrieval, but the detection timeliness needs to be optimized. However, various models have their specific adaptation targets, advantages, and limitations. How to combine the advantages of models and various algorithms in practical applications to construct a universal cross-modal retrieval model is one of the urgent problems to be solved in the current cross-modal retrieval research.

Other Cross-Modal Retrieval
Methods. In addition to the above classical methods, there are some other methods. For example, Feng et al. [27] proposed an automatic encoder (Corr-AE) model, which is characterized by using two autoencoder networks to encode image vectors and text vectors with each other to obtain two correlation loss terms for model training. Reference [28] proposed a retrieval method based on multimodal semantic autoencoder. is method uses an encoder decoder to learn projection and preserve feature and semantic information while ensuring embedding. e 2-way net model proposed in [29] also applies the idea of autoencoder, which is optimized in more detail than Corr-AE. Reference [30] proposed a graphic matching method based on semantic concepts and order (SCO), which is characterized by introducing a multilabel classification mechanism when retrieving images. Specifically, SCO performs a multilabel classification operation for each candidate image extracted by the target detection network so that each candidate image can not only carry entity category information but also add some attribute labels.
According to the above analysis, (1) in CCA method, the single-mode feature representation of different data is extracted first, and then associated learning is carried out.
is two-stage method cannot ensure that the extracted single-mode feature is the effective representation required by associated learning. (2) In the deep learning method, most networks use shallow networks to model the association learning part, ignoring the high-level semantic association between modes. (3) In the deep hashing method, some information will be lost when it converts the modal representation to hash coding. erefore, effective feature extraction and feature association learning are key to improving the accuracy of cross-modal retrieval. In order to make better association learning between different modal data, a cross-modal image and text retrieval method combining efficient feature extraction and interactive learning convolutional autoencoder (CAE) is proposed in this paper. e innovations of the proposed method are as follows: (1) Image feature extraction: e new convolution kernel constructed by 2DPCA is integrated into the image feature extraction based on residual network, which avoids the complex operation of traditional PCA and reduces the dimension of image spatial features.
(2) Cross-modal CAE architecture: Based on the traditional multimodal CAE architecture, a feature association module (i.e., joint public representation) is integrated to associate the representations of each mode to realize interactive learning, make the learned intermediate representation of each mode contain the association relationship between modes, and improve the accuracy of cross-modal retrieval.

Overall Framework.
In order to make full use of the advantages of complementary information of multimodal data, in the training stage, the proposed method takes image data and text data as the input of the network at the same time, carries out interactive learning of image and text features through multimodal CAE model, and generates the classification model of the retrieval system. In the test stage, the image or text features are input into the classification model for discrimination, and the retrieval results are obtained. e overall architecture of the proposed method is shown in Figure 1. Among them, the image data use the residual network as the image feature extractor and introduce two-dimensional principal component analysis (2DPCA) to construct a new convolution kernel. e text data use word2vec and long short-term memory (LSTM) network as the text feature extractor. e network fusion layer is designed using crossmodal convolution CAE based on interactive learning, and the two modal data features are fused and sent to the next fully connected layer. In order to learn the nonlinear mapping from the image-text data feature space to the semantic label space and prevent overfitting, the Batch Norm layer and the ReLU layer are added to the fully connected layer. e output dimension of the final fully connected layer is consistent with the data dimension of the real label. e proposed method takes full advantage of the complementary information of different modal data for multimodal dataimage data and text data.

Convolution Neural Network Is Used to Extract Image
Features. For the extraction of image features, a very mainstream residual network, which is more suitable for image features, is selected. e network has five convolution stages, each of which has a corresponding pooling operation. After inputting a piece of image data, it is processed in layers of convolution, and the size of the output image feature map is 7 × 7 × 2048, which can be processed according to the needs of subsequent machine learning tasks.
Image modal data have high dimensionality and rich content information. e selection of a deep convolutional neural network will extract effective visual monomodal representation features. Using W x to simplify the model parameters of the entire embedded subnetwork, the feature output h x of the image modal data after passing through this network is where X is the input image modal data.

Constructing a New Convolution
e difference image between each sample and the average image is e required covariance matrix is e optimal projection subspace U � η 1 , η 2 , . . . , η d can be constructed using the orthogonal eigenvectors corresponding to the first d eigenvalues of the covariance matrix. Mapping the original image to the projection space can obtain the feature image T i � Z i U after dimensionality reduction. e flow of the 2DPCA algorithm is shown in Figure 2.

Text Feature Extraction.
In the multimodal dataset used, the text modal data are mainly in the form of long text, so a reasonable representation that matches its characteristics is used for text feature extraction.
Short sentences: the text representation of short sentences is simpler than long sentences. It is represented by word vector (word2vec); that is, words are converted into vectors that can be accepted by machine learning tasks.
Long sentences: e representation of long sentences is more complicated because the words of the sentence are related to each other. e first or several words will affect the understanding of the following sentence, so the sentence's meaning should be grasped from the whole. In order to retain the previous information in the text, the LSTM network is used to first represent each word in the sentence by a word vector Y � y 1 , y 2 , . . . , y c , and c represents the number of words in the sentence, so each sentence is represented as a 300-dimensional word vector sequence.

Classical Convolutional Autoencoder (CAE).
An autoencoder (AE) is an unsupervised learning algorithm that makes the output close to the input by learning data representation. AE extracts data features through an encoder and then decodes the acquired features through a decoder to realize the reconstruction of input data. CAE is based on unsupervised AE, combining the convolution and pooling operations of CNN to convolve the encoder and decoder to achieve better feature extraction [31]. e single-layer CAE network model is shown in Figure 3. e coding part is composed of a convolutional layer and a maximum pooling layer.
Given M C1 feature maps I � I 1 , I 2 , . . . , I C1 , after convolution operation, a set of F C2 feature maps is obtained where g n (i, j) is the activation value at pixel (i, j) in the activation map of the n-th channel and a(·) is a nonlinear activation function. e size of the filter is F C2 � 2k + 1. F (1) n is the weight of the convolution filter in the encoding process, and the number of channels of each filter is the same as that of the input sample. b (1) n is the offset of the encoder convolutional layer to the activation map of the n-th channel.
e convolutional layer of the convolutional encoding part outputs a feature map of size (O C1 − F C2 /S C2 + 1) 2 × M C2 . After the maximum pooling operation, the final output of the encoding part is obtained. Among them, O C1 � ((w − F C1 + S C1 )/S C1 F P1 ) is the output feature map size of the convolution module C1. e decoding process is the process of reconstructing the original image from the feature activation map. CAE is a fully convolutional network, so the decoding process is mainly realized through deconvolution operation. Considering that the size of the feature activation map obtained after encoding is smaller than the original image, the size information of the original image cannot be reconstructed only through the transposed convolution of the decoding process. erefore, it is necessary to perform zero padding operation on the input feature map to decode later; a reconstructed image with the same size as the original image can be reconstructed. e convolution output of the encoding part is used as the input of the decoder and then convolved with the convolution filter F (2) to obtain the reconstructed image: where G is the set of feature maps obtained by encoding and b (2) n is the offset of the activation map of the n-th channel corresponding to the decoder deconvolution layer.

Cross-Modal CAE Based on Interactive Learning.
Different from the existing multimodal CAE models [32,33], while learning the representations of different modes, respectively, this method generates some association between the representations of each mode through a feature association module (i.e., joint public representation) after the hidden layer, to realize interactive learning. erefore, the intermediate representation of each mode contains the correlation between modes, which helps to improve the accuracy of cross-modal retrieval. e proposed dual-mode interactive learning CAE architecture is shown in Figure 4.
e input text and image data are, respectively, passed through the convolution layer and the pooling layer to obtain the data representation. en, through an intermediate interaction layer, the feature representation of text and image data is interactively learned to obtain a new joint public representation feature data. e original input can be obtained by deconvolution of the feature data [34][35][36].
In order to train the dual-mode interactive learning CAE, it is necessary to construct the objective function in the training stage. In classical CAE training, the objective function is usually to minimize the reconstruction error. However, in the dual-mode interactive learning CAE model, the interactive learning between multimodal features is integrated to improve the accuracy of model retrieval. erefore, the objective function needs to include the goal of maximizing the correlation between the two modal features in the hidden layer. e given input is z i � x i ; y i , where z i is the associated representation of the input views x i and y i . Self-reconstruction loss and cross-reconstruction loss are defined as Scientific Programming where g, h are the nonlinearity generally regarded as ReLU, g(h(x k i )) and g(h(y k i )) are the representations of the k th intermediate hidden layer (K � 2), and L is the error function. In the loss L 2 and L 3 (for cross reconstruction), the 0 vector is used instead of another view to calculate x i and y i . Finally, in order to enhance the interaction between the two modal features, the objective function of correlation loss is expressed as follows: where h(X) and h(Y) are the projections of the combined model (the projections of the joint public representation in Figure 4). X and Y are the representation of two modal features. λ k is the relative regularization hyperparameter used for each k th intermediate encoding step (similarly using λ in the decoding stage). In the encoding process, a convolution layer and two intermediate layers (K � 2)    Scientific Programming dataset, λ 1 � 0.004 and λ 2 � 0.05 in item L 7 and λ � 0.02 in item L 6 are uniformly set here. e correlation between the two views h(X) and h(Y) is where h(X) and h(Y) are the mean vectors of the hidden representations of the two views. h(x i ) and h(y i ) are hidden layer representations of a single modal view.
Integrate all objective functions to build a total objective function, which is expressed as follows: where θ is the model parameter. e above formula minimizes self-reconstruction and cross-reconstruction and maximizes the association between views.

Experimental Dataset.
In order to verify the performance of the proposed method, the effectiveness of the method is verified on three commonly used real cross-modal graphic retrieval datasets: Flickr30K dataset, MSCOCO dataset, and Pascal VOC 2007 dataset.
(1) Flickr30K: e Flickr30K dataset contains 31,783 images, and the English description of the images is 158,915 sentences. at is, each image corresponds to 5 sentences with different description sentences. e sentence descriptions of these images are obtained through manual annotation. e Flickr30K dataset is divided into three parts: 1000 images and corresponding descriptions as the verification dataset, 1000 images and corresponding descriptions as the test dataset, and the remaining part as the training dataset.
(2) MSCOCO: e MSCOCO dataset contains 123287 images, and each image also corresponds to 5 different description sentences. is dataset is divided into four parts, including 82783 images as the training dataset, 5000 images as the verification dataset, 5000 as the test dataset, and 30504 images as the reserved dataset. e Pascal VOC 2007 dataset contains 5011 image-annotation pairs for training and 4952 image-annotation pairs for testing, all from the Flickr website. Each sample pair is labeled as one of 20 semantic categories. is dataset is randomly divided into three subsets: training set, test set, and validation set, which contain 800, 100, and 100 samples, respectively. e experimental running environment is a PC configured with Intel Core i7-7700 CPU and Nvidia GTX1070Ti 8G video memory GPU. e deep learning framework used is PyTorch, and the development language is Python.

Performance Index and Comparison Method.
e evaluation indexes commonly used in the cross-modal retrieval field are selected to compare and analyse the proposed methods: the mean average precision (MAP) and the precision-recall (PR) curve. Among them, MAP can effectively evaluate the experimental results through the positions of positive samples and negative samples in the search results. AP represents the average accuracy of each specific search, calculated as follows: where N represents the total number of search results that belong to the same semantic category as the query. n is the number of all results returned by the search. k is the position index in the search result sequence. P(k) is the accuracy of the first k search. φ(k) indicates whether the k th search result and the query have the same semantic category (the same value is 1, and the value is 0 if they are different). e value of MAP is the average of AP values corresponding to multiple searches: where Q represents the total number of searches. Use MAP@R to indicate that given a query, sort the top R results with the highest similarity according to the similarity. e accuracy of these R results was averaged: e PR curve is the curve of the accuracy rate changing with the recall rate, which is used as the performance evaluation index in cross-modal retrieval.
In the experiment, the three selected datasets have two modes: image and text. is model is compared with the reference model on two retrieval tasks, namely, retrieving text with images and retrieving images with text. For example, when retrieving images based on text, the proposed method selects each text in the test set to retrieve all images in the test set and finally obtains the retrieval result.
In order to verify the effectiveness of the proposed method, it is compared with two classical methods: CCA and deep hashing method. e corresponding research is a multilabel kernel canonical correlation analysis (ml-KCCA) method proposed in [15] and a cross-modal hashing retrieval method (DNDCMH) proposed in [22]. In addition, in order to highlight the effectiveness of the interactive learning CAE model proposed in this paper, it is compared with other methods based on the CAE model, such as the text retrieval method based on multimodal semantic automatic encoder (SCAE) proposed in [28].

Image-Text Retrieval Analysis.
e image-text retrieval results obtained by the proposed method and [22] retrieval method are shown in Table 1. It is the text retrieval Scientific Programming result of the image on the Flickr30K test set. e text in bold is the correct recall text, and the text without bold is the wrong recall text.
It can be seen from Table 1 that the proposed method has better retrieval results in terms of recall index. Specifically, in the text retrieval task, the proposed method uses image search to find the correct text sorting more advanced. ese visually presented phenomena more intuitively illustrate the effectiveness of the proposed method. In [22], DNDCMH is used to achieve text retrieval. Due to the lack of image feature extraction effect, the correct text is less.

Text-Image Retrieval Analysis.
In order to compare the performance of the proposed method and the comparison method [15,22,28], in text-image retrieval, the 'car' is used as the query text to retrieve the image on the Pascal VOC 2007 dataset. e top 5 images retrieved by various methods are shown in Figure 5.
It can be seen from Figure 5 that compared to other comparison methods, the text retrieval results of the proposed method are more reasonable. Since the proposed method uses word2vec and LSTM network for text feature extraction, the extraction effect is better. erefore, the retrieval images obtained through the CAE network of interactive learning are more accurate.

Performance Comparison.
In order to demonstrate the retrieval performance of the proposed method in the three datasets, it is compared with the methods in [15,28] and [22].
e MAP values of the first 50 results of the four methods are shown in Table 2.
It can be seen from Table 2 that, in the two retrieval tasks of retrieving images by text and retrieving text by images, the proposed method has significantly improved MAP on these three datasets compared with other comparison methods. Since the Pascal VOC 2007 dataset has the largest magnitude, the proposed method has the most significant improvement on Pascal VOC 2007. On Flickr30K, MSCOCO, and Pascal VOC 2007, three cross-modal graphic retrieval domain datasets, the average MAP on the two retrieval tasks of the proposed method are 0.359, 0.334, and 0.309, respectively.
In addition, with different methods on the Flickr30K dataset, the PR curves for two different retrieval tasks of image retrieval and text retrieval are shown in Figure 6. e ordinate represents the precision, and the abscissa represents the recall. Similarly, the PR curves of two different retrieval tasks on MSCOCO and Pascal VOC 2007 datasets with different methods are shown in Figures 7 and 8, respectively.
It can be seen from Figure 6 that whether it is image retrieval text or text retrieval image, the area of the PR curve of the proposed method is larger than other comparison methods. Because it adopts the cross-modal retrieval method of image and text interactive CAE and incorporates 2DCPA into the feature extraction, the accuracy of retrieval is improved. Reference [15] proposed a ml-KCCA method to achieve cross-mode retrieval, but the retrieval performance is low due to poor feature extraction. Reference [28] combined low-level features and high-level semantic information to learn feature representation. Although it solves the problem of feature representation, due to the lack of feature interaction, the retrieval accuracy for complex environments still needs to be improved. Reference [22] used the DNDCMH method to complete cross-modal retrieval. However, this method has poor universality, so the retrieval performance is inferior to the proposed method.
It can be seen from Figure 7 that the retrieval performance of the proposed method is better than other comparison methods in the two retrieval tasks of image retrieval text and text retrieval image. When the recall is 0.2, the accuracy of each method reaches the maximum, and the recall increases and decreases continuously. Since the MSCOCO dataset has relatively few samples, the area composed of PR curves of different methods has increased compared to the Flickr30K dataset.
It can be seen from Figure 8 that, like the first two datasets, the retrieval performance of the proposed method on the Pascal VOC 2007 dataset is better than other comparison methods. e proposed method uses the residual network to extract image features and introduces 2DPCA to construct a new convolution kernel. At the same time, using

Scientific Programming
The proposed method Ref. [22] Ref. [15] Ref. [28] Figure 5: An example of image retrieval with text "car." The proposed method Ref. [22] Ref. [28] Ref. [15] 0 The proposed method Ref. [22] Ref. [28] Ref. [15]  Scientific Programming word2vec and LSTM network for text feature extraction, feature extraction is more efficient. It is better than [15] using existing label information and [22] using specific images. In addition, [28] used the semantic CAE method to learn multimodal mapping and projected multimodal data into low dimensional space to retain feature and semantic information and improve retrieval accuracy. However, the proposed method uses the CAE model with interactive learning, and the fusion effect of image and text feature learning is better, so the retrieval performance is more ideal.
In summary, it can be seen from the PR curves on different datasets that the proposed method shows the best results under different recall.
is proves that the deep interactive learning method constructed by it is effective.

Conclusion
Cross-modal retrieval technology meets people's more diverse retrieval needs and solves the problems of heterogeneous gap and semantic gap between different modal data. However, the retrieval accuracy still needs to be improved. For this reason, a cross-modal image retrieval method combining efficient feature extraction and interactive learning CAE is proposed. e residual network convolution Ref. [22] Ref. [28] Ref. [15]  Ref. [22] Ref. [28] Ref. [15] (b) Ref. [22] Ref. [28] Ref. [15]  Ref. [22] Ref. [28] Ref. [15] (b) kernel is improved by incorporating 2DPCA to extract image features, and text features are extracted through LSTM and word vectors to obtain image and text features. After that, the two features are input into the cross-modal CAE of interactive learning, and through the interactive learning of the middle layer, the image-text retrieval is realized. In addition, the proposed method is experimentally demonstrated based on the Flickr30K, MSCOCO, and Pascal VOC 2007 datasets.
e results show that the proposed method can complete accurate image retrieval and text retrieval. Moreover, the average MAP on the two retrieval tasks is 0.359, 0.334, and 0.309, which are higher than other comparison methods. e same is true for the area formed by the PR curve.
At present, the method proposed in this paper is only suitable for cross-modal retrieval between text and image, but there are many types of multimodal data on the network. Next, more data of different media types such as audio and video will be expanded to meet people's broader retrieval needs.
Data Availability e data included in this paper are available without any restriction.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.