CCAH: A CLIP-Based Cycle Alignment Hashing Method for Unsupervised Vision-Text Retrieval

Due to the advantages of low storage cost and fast retrieval efciency, deep hashing methods are widely used in cross-modal retrieval. Images are usually accompanied by corresponding text descriptions rather than labels. Terefore, unsupervised methods have been widely concerned. However, due to the modal divide and semantic diferences, existing unsupervised methods cannot adequately bridge the modal diferences, leading to suboptimal retrieval results. In this paper, we propose CLIP-based cycle alignment hashing for unsupervised vision-text retrieval (CCAH), which aims to exploit the semantic link between the original features of modalities and the reconstructed features. Firstly, we design a modal cyclic interaction method that aligns semantically within intramodality, where one modal feature reconstructs another modal feature, thus taking full account of the semantic similarity between intramodal and intermodal relationships. Secondly, introducing GAT into cross-modal retrieval tasks. We consider the infuence of text neighbour nodes and add attention mechanisms to capture the global features of text modalities. Tirdly, Fine-grained extraction of image features using the CLIP visual coder. Finally, hash encoding is learned through hash functions. Te experiments demonstrate on three widely used datasets that our proposed CCAH achieves satisfactory results in total retrieval accuracy. Our code can be found at: https://github.com/CQYIO/CCAH.git.


Introduction
As the internet and social networking grow rapidly, multimedia information data such as images and texts are increasing dramatically, and it is a great challenge to retrieve these data efciently.Cross-modal retrieval aims to search for heterogeneous modal data with a similar semantic representation by one modality.Hashing methods [1][2][3][4][5][6][7][8] are widely used in retrieval tasks to improve storage and computational efciency.Cross-modal hashing methods attempt to represent heterogeneous modal data as compact binary codes while maintaining semantic similarity between diferent modal data in a common hidden space.
Cross-modal hashing methods fall into two broad categories: supervised methods and unsupervised methods.Commonly available supervised hashing methods [2,7,[9][10][11][12][13] have demonstrated signifcant performance.Te principle is to use hand-labeled label information or precomputed similarity matrices to guide model training and learning of binary codes.Unfortunately, in real-world and more challenging scenarios, images are often accompanied by their textual description, but difcult to obtain their labels, categories, or tags.
Recently, an increasing number of research hotspots have emerged in unsupervised cross-modal hashing methods.Unsupervised hashing methods [1,[14][15][16][17][18] attempt to get rid of the model's reliance on manually annotated data during training, relying solely on the features of the data itself, and demonstrate superior performance.However, a common drawback of the above-unsupervised approach is that the co-occurrence information inherent in the visiontext is easily overlooked in the high-level semantic feature extraction process due to the lack of labeling information guidance (Figure 1).Tis further leads to unsupervised models that are unable to accurately capture the semantic connections between diferent modal data, making retrieval accuracy suboptimal.In view of this, we point out that hash codes of images and text that appear in pairs should have either a minimum Hamming distance or a maximum degree of semantic similarity.
In addition, most existing cross-modal methods focus on the alignment of semantic features between cross-modal data (GAN [19]).Simplifying the semantic association of reconstructed features within a modality with the original features makes the generated hash codes not perfectly compatible with cross-modal retrieval tasks.Inevitably, there is an inherent modal divide problem in high-level semantic interactions, where one cannot pay attention to both intramodal and intermodal semantic information of one's modality, nor can one bridge the alignment of modal features and hash encoding, resulting in retrieval results that do not achieve optimal solutions.
To solve the above problem, in this paper we propose a novel deep unsupervised cyclic semantic alignment crossmodal hashing method termed CLIP-based cycle alignment hashing for unsupervised vision-text retrieval (CCAH).CCAH is an end-to-end learning framework that simultaneously notices both intramodal and intermodal semantic features and hash code consistency.Our CCAH network model consists of three components: deep feature extraction, cycle alignment, and hash encoding learning.Previous unsupervised network models have sufered from a problem of low accuracy in text retrieval images.It is well known that in image text pairs, images contain richer semantic information and can extract higher-level semantic representations at a fner level.Compared to the corresponding text description, (e.g.,: BOW) the text has relatively little semantic information, and often only a few keywords can be matched to the described image area (attention points).Moreover, the text has a contextual relationship, and the same word may represent diferent semantic information, resulting in text retrieval images that are often less accurate than image retrieval text.We propose to consider the text as data in a graph structure, transforming text features into node information in the graph, further fusing sparse text features by using GAT networks, and fusing related neighboring node information with the original nodes in an attention scoring mechanism, while the attention score indicates the closeness of the connection between diferent nodes, with higher scores being more closely related.And the auto-encoder is used to encode and decode the extracted modal features.Our contribution to this work is as follows: (i) We propose a new deep hash network model called CCAH.CLIP is used as a visual coder to extract fnegrained features.Te GAT network is also used for feature extraction of text modalities.(ii) We propose a circular alignment method to align image features with features extracted by auto-encoder, and then align the features after mapping them to the text modality space to ensure semantic links between modalities and vice versa.(iii) Te experiments demonstrate that our model achieves satisfactory results in terms of fnal total retrieval accuracy under three commonly used multimodal datasets.

Related Work
Currently, cross-modal hash retrieval is broadly divided into supervised and unsupervised hashing.Supervised hashing methods have better performance compared to unsupervised methods with the aid of labels or similarity matrices to avoid redundant information interference.
2.1.Supervised Hashing Methods.Supervised hashing methods: supervised hashing methods use manually annotated label information or load predefned similarity matrices to guide the training of binary encoding between diferent modalities and have shown excellent implementation in multimodal data retrieval.Recently, many supervised hashing methods have been used to continuously improve the retrieval accuracy benchmarks.TDH [20] uses triples to fexibly capture a variety of higher-level similarities, rather than the simple similarity or dissimilarity of binary groups, sorting to optimize intraclass and interclass variation; SCM [13] learns the hash function bit by bit using supervised information in linear time complexity; DOH [21] learns ordinal representations to generate ranking-based 2 International Journal of Intelligent Systems hash codes by leveraging the ranking structure of feature space from both local and global views; Seph [3] uses a probability distribution, which is approximated by minimizing Kullback-Leibler divergence, to a hash code learned in Hamming space; QCH [9] proposes to simplify the optimization process by transforming the multimodal objective function into a unimodal formalism; MCSCH [12] proposed a multiscale association mining strategy, which is a multiscale feature-guided sequence hashing method; DLFH [11] introduces a discrete learning algorithm that learns binary hash codes directly, without the need for successive relaxations.However, the above methods require a lot of manual and fnancial efort to label the dataset during the hash function learning process, which is often unrealistic in reallife scenarios.And without labeling information, the retrieval accuracy inevitably degrades.

Unsupervised Hashing Methods
. Unsupervised hashing methods: to reduce the need for manual annotation information during model training, unsupervised cross-modal hashing methods are proposed.CVH [1] learns binary codes by minimizing the similarity-weighted Hamming distance; IMH [6] builds two intramodal similarity matrices based on neighbor relations; CMFH [16] uses matrix decomposition to address the semantic relevance of diferent modalities and maps heterogeneous modal data into a hidden state space; UDCMH-based [17] learning of features and hash codes under Laplacian and discrete constraints; DJSRH [14] fuses semantic information into the afnity matrix to calculate potential correlations between modes; DSAH [22] aligns intramodel and intermodal data by fusing them using semantic similarity alignment and heterogeneous modal data reconstruction; JIMFH [23] combines intramodal and intermodal hash codes to obtain the fnal hash code; DBRC [24] proposes a framework with adaptive binary reconstruction that allows discrete hash codes to be learned directly; HNH [25] weighted the original similarities using Hadamard products and created a joint similarity matrix using linear combinations.Although these unsupervised cross-modal hash models have achieved better results regarding the colinear information of image text pairs, they still ignore part of the image information, resulting in poor accuracy of text retrieved images. .We defne I i to represent the i-th image and T j to represent the j-th text.Each image text pair instance can be represented as o k � I k , T k  .We defne the representation of the feature dimension as F. Te semantic features extracted by the visual feature encoder denoted as F I and F I ∈ R m×D I , D I is the high-level dimensional feature representation of the image obtained by passing the original vision through the image encoder.We also defne the feature representation of the text after the text encoder as F T ∈ R m×D T , D T denotes the high-level feature dimensional representation of the text, and m is the number of sample instance points.In addition, we defne the hash code representation as B * ∈ − 1, +1 { } m×c , and * ∈ I, T { }.Here c denotes the length of the hash code, and the hash code of the i-th original data in B * is denoted b * ,i .In addition we defne the cosine similarity loss function for paired image text as cos(•), and use the sign(•) function for element wise symbolic functions.Te defnitions are as follows:

Problem Formulation
here we defne ‖ * ‖ to denote the l 2 regularization paradigm for the Frobenius regularization of vectors and matrices.

Model.
In Figure 2, we show all the components of our model.Te CLIP-based cycle alignment hashing for unsupervised vision-text retrieval (CCAH) consists of three parts, namely, the feature extraction part, the cycle semantic alignment part, and the hash coding learning part.
Graph networks [26] represent node information as a graph, transforming the graph topology into a constructed adjacency matrix by aggregating node-to-node associations, fusing the information of each node and its neighbors into a new node.With attention [27] showing advanced execution in NLP and CV, the attention mechanism is introduced into graph networks, where instead of just doing a simple fusion, the attention algorithm gives each node an attention score, and then fuses the diferent nodes for information.Less relevant feature words have a lower score, and feature words that are more relevant to them can receive a high attention score.In fusing this information, the infuence of diferent feature words on the nodes is reinforced and better semantic information can be extracted.
Since our text is a 1386-dimensional feature vector representation, we treat these features as node data and each text can be represented as f i ∈ R 1386 .To obtain sufcient expressive power to transform the input features into higher-level features, after a learnable weighting matrix W ∈ R f×f transformation, then self-attention is applied to the node.
where a is the attention calculation factor, e ij denotes the importance of node j to node i.We calculate each neighboring node of node i.To make the coefcients easily comparable between diferent nodes, we use the softmax function to normalize all neighboring nodes.
International Journal of Intelligent Systems T represents the transpose of a vector.By doing this for all nodes, the node information of the adjacency matrix has been transformed into a new node vector containing the attentional features of each neighboring node, which is the most easily weighted fusion of semantic information that is lacking in the text modality, leading to a more powerful modal representation of the text modality.Te graph attention network is a fusion of feature words associated with a certain feature word with its associated feature words using attention.And the weighted fusion employing the attention mechanism can obtain a new semantic feature representation containing the information of neighboring nodes.

Deep Feature Extraction.
In order to extract richer information about the high-level semantic representations of the modalities, we design diferent modal encoders for diferent modal data.Since image modalities contain richer semantic information than text modalities, and the singlestream model (eg: ViLT [28]) cannot bridge the inherent modal gap across modalities, cannot perform optimal feature extraction for each modality, and has limited ability to mine semantic consistency information for heterogeneous data, we adopt a dual-stream model to extract semantic features for diferent modal data information and show excellent results throughout the training phase.Te results were excellent throughout the training phase.
(1) Image Feature Extraction.CLIP [29] used a training method of contrast learning in unsupervised learning, using a dataset of huge size for training, compared to ViT [30], which yielded good quality results on several datasets.Using the CLIP pretrained model as a feature extractor for image modalities in our model.In the image section using CLIP's image encoder (encode-image), we feed the original image into the CLIP image encoder (Figure 3), and after extraction, we obtain a 1024-dimensional high-level semantic vector, which we defne as F I ∈ R m×1024 .
(2) Text Feature Extraction.We consider text modal data as not containing as much high-level semantic information as image data, but text semantics are contextually relevant.We treat the features of text as nodes of a graph and use graph attention networks (GAT [31]) to extract aggregated semantic information from text.GAT treats text features as nodes, and converts input features into higher-level features to obtain more powerful expression, introduces an attention mechanism, performs self-attention on nodes, and fnds the attention weight coefcients between nodes; and by weighted summation of surrounding neighboring nodes, you can get information that aggregates all surrounding nodes, making the connection of text information more realistic (Figure 4).Te text features are constructed as adjacency matrices, and the information in the adjacency matrices represents the linkage of text modalities, and the semantic representation of text can be better processed by weighting the features.Te original text message is characterized by F T ∈ R m×1386 .
For simplicity, we defne the feature extractor as F. Te mathematical notation of each modal feature extractor is defned as follows: where I and T are the original image and text.Θ I and Θ T are the parameters of the feature extractor.To this end, we can extract semantically rich high-level representation features for each modality, which can be used to fully explore the semantic relationships between the data and further guide modal alignment and hash code learning.We construct similarity matrices within and across modalities, and the generated hash matrices are also aligned between modalities to be able to guarantee semantic alignment within modalities, hash encoding, and features across modalities, and hash matrix to hash matrix alignment. 4 International Journal of Intelligent Systems

Cycle Alignment.
To facilitate intramodal semantic feature alignment and to maintain cross-modal data semantic interaction, we propose the use of circular semantic alignment methods.Te distance between semantically similar vision-texts is promoted to be close in the common representation space and vice versa makes the distance in the common representation space farther.To further align text and images we use intramodal and intermodal loss measures.We use auto-encode to compress the high-level semantic features into low-level semantic representations and to reconstruct this underlying semantic feature back into a feature of heterogeneous data.We defne the function that compresses the high-level semantic representation as follows: where F * denotes the original features of the image and text, δ * is defned as the parameter under each modal pass Enc( * ), and * ∈ I, T { }.Te high-level semantic features extracted by the feature extractor are encoded and compressed by the encoder to obtain a true-value semantics with strong representational power and containing highly semantic features, which we then reconstruct back into a representation of the heterogeneous data by means of a decoder, which we defne as follows: International Journal of Intelligent Systems We input the features of the image (text) into the decoder and the semantic information obtained is then mapped to the feature space of the text (image) by the decoder to achieve semantic alignment between modalities.After obtaining the reconstructed features of heterogeneous data, to facilitate cross-modal information interaction.We semantically align the original image features with the text reconstructed by the decoder.To ensure that the resulting compressed feature vector represents the original high-dimensional feature representation, we align the high-dimensional features with the encoded features once as well, achieving intramodal semantic alignment.
(1) Intermodal.To facilitate information interaction between diferent data and achieve cross-modal semantic interaction, we use the semantic features obtained by the feature extractor of one modality to be mapped to the corresponding semantic space of another modality after being decoded by the autoencoder.F T,I represents the vector representation after mapping the text features to the image feature space, and F I,T represents the vector representation obtained by mapping the image features to the text feature space.We construct the cross-modal semantic feature matrix S F T,I and S F I,T .Alignment of diferent modal types is achieved by minimizing crossmodal semantic losses, with the following loss function: Te total intermodal loss is expressed as follows: We can leverage the high-level semantic feature representations between two modalities for cross-modal alignment, and we achieve cross-modal heterogeneous data alignment by computing the minimization L C− inter .
(2) Intramodal.To ensure the representativeness of the semantic information within the modality and to reduce semantic feature loss, we also perform intramodal constraints within the same modality, and we align the features extracted from the original image with the higher-level semantic representations encoded by the auto-encoder.Ensuring representability and completeness of high-level semantic information within a modality by minimizing L C− intra , we construct the image modal feature matrix as S F I,I after auto-encoder to obtain the features of the hidden state, which is denoted by S Enc− I .Te text features are also represented by S F T,T for the original extracted features and S Enc− T for the features decoded by auto-encoder.We defne the intramodal losses as follows: Terefore, we construct a semantic alignment method with intramodal and intermodal alignment, which achieves intramodal semantic alignment by aligning the high-level semantic representation extracted by the visual coder and the text encoder with the compressed semantic features of the feature after auto-encode, ensuring that the high-latitude modal data can be restored with a small number of high-level features, aligning the heterogeneous data with the original modal features through the mapping of the decoder, enabling information interaction across the modal data, and achieving intramodal and intermodal alignment.We defne the loss of cycle-alignment as follows:

Hash Encoding Learning.
After feature extraction and cycle semantic alignment, the semantic information of the text and visual data can be extracted and interlinked in a high-quality way.In the area of cross-modal retrieval, we aim to make semantically more similar heterogeneous data more closely related, by fnding semantically related data samples from one modality in the dataset from query points in another modality according to a defned similarity metric.By converting the query points into a hash code, the corresponding modal information can be retrieved more quickly.With the AE (auto-encoder) mapping, we can fully extract the high-latitude feature encoding corresponding to each modality during the training phase.We perform the mapping of the hash encoding through the AE generated feature vector, and due to the feature extraction and reconstruction semantic operations, we use the true value to construct the hash encoding and generate hash codes via the tanh(•) function.
We compute this pairwise cosine similarity matrix by defning them as S B xy , which is used to represent the generated hash matrix.Te visualization of the feature generation hash encoding is shown in Figure 5. Te hash matrix of text modalities is denoted as S B TT and the hash matrix of the image part is denoted as S B II .For the matrix elements, we calculate by using the following cosine similarity: In addition inspired by [22], to make fuller use of the semantic information described jointly by image text pairs, we construct cross-modal hash code similarity matrices where colinear image text pairs have the most similar labels or categories compared to other modal data, and the elements on the diagonal are better as they should be closer to 1, decomputed into hash codes for image text pairs, and minimizing the loss of colinear instances as follows: Regarding other elements, we use diagonal similarity loss to bridge the connection between diferent modalities, e.g., 6 International Journal of Intelligent Systems the same pair of image text similarity should be independent of location information and only related to feature information, bridging the semantics of the image text pairs together by minimizing the diagonal loss, which we defne as follows: min Te total loss on S B I,T is as follows: After autoencoder encoding, we map the obtained features to our hash codes through hash functions, and we use these hash codes to construct the similarity matrices S B I,I and S B T,T .In addition, we introduce a new similarity matrix that we obtained by mapping the hash function S B I,T , which is constructed from image-text labels.We do not use labels for bootstrapping in the training phase, but introducing label information mainly to calculate the hash loss.
Whereas hashing methods can speed up the retrieval process, mapping truth-valued features to hash codes still results in some missing information, leading to suboptimal solutions for retrieval.In hash encoding learning, we also need to pay attention to the semantic relationships between data from diferent modalities, and similarity information across modalities is a central task in cross-modal retrieval.Based on this, we align the features within individual modalities with the generated hash codes to ensure that the generated hash codes are more realistic representations of the original data information.k is the modal adjustment parameter that allows more fexibility to ensure our semantic similarity.
We are constructing a joint feature matrix that integrates the text feature matrix with the image feature matrix in a weighted way, which is represented by only one common matrix S F I,T .Te ɑ is a hyperparameter that can be used to weight the feature matrix of images and text.
In optimising our hash encoding based on matrix alignment.
Te total loss between modes is as follows: 3.3.Optimisation.We combine these losses to construct our total loss function as follows: min Moreover, during our training process, the cyclic semantic interaction module uses truth codes, and during the training process, if the truth codes are converted into hash codes, some information will be lost, and the truth features are more conducive to the training of the model, and the truth codes generated after multiple modal interactions are closer to the hash codes.However, the generated truth codes cannot be gradient-derived because they are discrete values.To solve this problem, inspired by (lim η⟶∞ than(ηx) � sign(x)), we transform them into binary hash codes via tanh(.)with the following function defnition: Te proposed CCAH algorithm is shown in Algorithm 1.

Experiment
Datasets: our experiments were tested on three cross-modal retrieval datasets, including MIRFlickr-25K [32], NUS-WIDE [33], and MS COCO [34], to validate the efectiveness of our proposed model.Te datasets are described as follows: MIRFlickr-25K: MIRFlickr contains 25,000 image-text pairs collected from the Flickr website.Each image text pair is saved as an instance.And for text patterns, after DJSRH [14], each text will be sorted and tagged with occurrence characteristics and transformed into a BOW (bag-of-words) vector.NUS-WIDE: NUS-WIDE consists of 269,648 pairs of multimodal data containing 81 categories, with each multimodal instance containing an image and corresponding label.For simple processing, we selected the 10 most frequent categories from the original 81 categories and the 186,577 tagged instances in all pairs.Te text of each instance was represented as a 500dimensional bag-of-words (BOW) vector.We collated the index vector of the most frequent 1,000 text labels.MS COCO: MS COCO was originally collected for the image understanding task and contains 123,287 images.
For each image, a text description and a 91-dimensional semantic label are given.Te experiment contains 87,081 images with category information and uses a 2,000-dimensional bag-of-words vector to represent the textual information.Of these, 5,000 image-text pairs were randomly selected as the query set and the remaining image-text pairs were used as the retrieval set.For the training set, 10,000 pairs were randomly sampled from the retrieval set.

Implementation Details.
We used CLIP as a feature extractor for image modality and GAT as a feature extractor for text modality.We used cyclic modal interaction to achieve semantic alignment within and between modalities (Intramodal and intermodal).We use hidden features of one modality to reconstruct features of another modality and carefully set some hyperparameters α, β, k to assist learning.We analyze the sensitivity of these parameters based on experiments.Finally, we selected our parameters as α � 0.8, β � 0.2, and k � 1.5, batch-size is 16, the learning rate is 0.005 for both image and text modalities, the SGD optimization strategy is used, and the weight decay is set to 5 × 10 − 4 .

Baseline and Validation.
Evaluation criteria: we use three cross-modal common datasets, MIRFlickr-25K, NUS-WIDE, and MS COCO to validate our model.For MIR-Flickr-25K and NUS-WIDE, we follow [14,16,17] and sample 2,000 instances as query points and the remainder as query database.Due to the overwhelming amount of data in MIRFlickr-25K and NUS-WIDE, we randomly sampled from one of the datasets in the database set for training.For fairness in training, we took some instances from each class International Journal of Intelligent Systems in the frst round of training and randomly sampled them in the remaining stages.In the MS COCO dataset, we take 10,000 instances as the retrieval set and the remainder makes up the database set.In our experiments, we take MAP and precision @ top-curves as the model judging criteria.
We compare with previous work on the MIRFlickr-25K and NUS-WIDE datasets, where we used a benchmark of MAP@50.Te total retrieval accuracy of our CCAH model demonstrates better results than previous work in diferent coding lengths as shown in Table 1.
As can be seen, our experimental data demonstrate excellent results on two widely used datasets, with signifcant gains in both image retrieval text and text retrieval image on MIRFlickr-25K, and slightly worse results for image retrieval text on the NUS-WIDE dataset, but signifcant gains in text retrieval image accuracy, and gains in overall retrieval accuracy.Te NUS-WIDE (tc-10) dataset was used, taking the most common 10 classes as the composition of the dataset.As the NUS-WIDE dataset is relatively large, it is not possible to ensure that the classes of the sample points taken are equal when sampling the sample points, and the data is more sparse when constructing the adjacency matrix, leading to a reduction in the efciency of image retrieval of the text.To validate our theory, guided by DAEH [42], we tested again on the MS COCO dataset, which uses class 81.We used MAP@5000 to evaluate our model and the results are shown in Table 2.

Ablation Experiment.
We experimentally validate the efect of diferent modules on the accuracy and we validate the model on the MIRFlickr dataset for 128 bits.We have also made other attempts.In the encoding and compression phase, we adopt a two-way model where the compressed vector reconstructs both its original features and the original features of the heterogeneous data, rather than just the features of the heterogeneous data.We validated this on the MIRFlickr and NUS-WIDE datasets.Te results show that if we add homogeneous feature reconstruction, there is a relative 1% improvement in image retrieval of text, but the accuracy of text retrieval of images decreases (Table 3).
In Table 4, we perform ablation experiments on diferent modules to demonstrate the efectiveness of our proposed method.

Visualization of the Learned Representation.
To visualize the efectiveness of the proposed CCAH, we use t-SNE to visualize the learned representation of images, text on the Flickr-25K dataset (Figure 5).Te original feature representation of the images and text are shown in Figures 5(a) and 5(c), respectively.It can be seen that the distributions of these modalities have large diferences and it is difcult to distinguish the samples by the original representations.Figures 5(b) and 5(d) gives the distribution of the learned representations of the images and text, respectively.It can be seen from the fgures that the proposed CCAH method helps to distinguish samples with diferent semantic classes and some clusters show distinguished intervals.

Hyperparameter Sensitivity.
We further validated our parameters k, α, and β on three datasets using 128 bits coding lengths.k is the infuence factor by which we optimize our hash matrix with the eigenvalues into an alignment of the hash code, and we fnd that the best results are obtained when k � 1.5.α is the parameter for aligning images and text across modalities.It is known that image modalities contain richer semantic features than text modalities (Figure 1), so when weighting images and text, the image component is weighted more than the text, and our model Encode the features to get the hidden states, by Enc( * ); (6) Using the hidden states to generate truth matrix and hash codes; (7) Decode the hidden states to generate heterogeneous features F I ′ and F T ′ (8) Calculate the objective function; (9) Back propagate the gradient with the chain rule; (10) Update the whole parameters; (11)    International Journal of Intelligent Systems achieves the best results when α � 0.8.β is the parameter that balances the hash encoding with the original features and also boosts the intramodal and intermodal coefcients.Te visualization of hyperparametric sensitivity is shown in Figure 7.
4.6.Comparing Other Models.On the 3 cross-modal common datasets mentioned above, our results are signifcantly improved compared to other models, and our total retrieval accuracy in top-k exceeds previous methods in all cases.We added the GAT network, which successfully constructs adjacency matrices employing graph neighbors to attentionally boost semantic feature-poor text modalities with higher accuracy compared to traditional bag-of-words features.Using CLIP to extract image features, the CLIP large-scale pretrained model can extract features from images at a fner level.We construct a cyclic semantic alignment module to construct the semantic features of the heterogeneous modes by using the hidden state vector of each mode from the self-encoder, compared to using a binary code to construct the features, the true value information is more representative of the mode features and a lot of useful information is lost by using the binary code.

International Journal of Intelligent Systems
We perform validation of our model on the MS COCO dataset and we mark the detected images with manual regions.In text retrieved images, text marked in red is the feature word of the text (corresponding to the image marker region); in image retrieved text, text marked in red indicates that the retrieval result does not quite match the description of the image Figure 8.

Conclusion
In this paper, we propose a novel deep unsupervised crossmodal hashing method, CLIP-based cycle alignment hashing (CCAH) for unsupervised vision-text retrieval.We construct a cycle alignment module that allows for more fexible exploitation of high-level semantic information within and A man is riding on the top of an elephant a man on skis stands in the snow A female tennis player is standing on a tennis court.
A female tennis player is running to catch the ball.A female tennis players wings her racket at a tennis ball and a line umpire stands behind her A female tennis player raises her racket to hit the ball on a tennis court.
A woman is holding a tennis racket as the ball is in front of her.
There is a man riding a bike carrying a bag A man riding a bike along side of a yellow bus.
A person riding a bike in front of a bus A guy riding a bike close to a car in the street a person riding a bike on a city street   International Journal of Intelligent Systems across modalities.To further bridge the gap between the two modalities, we use the hidden state vector of one modality to reconstruct the features of the other modality, enabling cross-modal data to be mutually characterized.Extensive experiments on three benchmark datasets show that CCAH outperforms several state-of-the-art methods in multimodal data retrieval tasks.

Figure 1 :
Figure 1: As images contain richer higher-order semantic information than text, text retrieval of images usually pays attention to image regions that are consistent with the text representation, resulting in missing vision modal semantics and reduced accuracy of text retrieved images.

Figure 2 :
Figure2: Te entire architecture of our model is represented in the fgure above, with the orange region indicating the imaging modality and the green region is the text modality.We construct similarity matrices within and across modalities, and the generated hash matrices are also aligned between modalities to be able to guarantee semantic alignment within modalities, hash encoding, and features across modalities, and hash matrix to hash matrix alignment.

Figure 3 :′Figure 4 :
Figure 3: We use CLIP image encoder for the images, with the left side representing the original image and the right side representing the results of attention visualization for diferent levels of features.

Figure 5 :
Figure 5: t-SNE visualization of the data on the Flickr-25K.(a) Original image features.(b) Image encoded feature distribution.(c) Original text features.(d) Text encoded feature distribution.In the fgure, the circle (○) and star ( * ) denote the representation of text and image samples, respectively, and diferent colors denote the representation with diferent semantic categories.

Figure 8 :
Figure 8: Text query image and image retrieval text on the MS COCO dataset.Te building boxes of images are manually labelled for readability.Te image retrieval text is shown in red for incorrect retrieval results.

Table 1 :
Comparison results on mean accuracy (MAP@50) for diferent code lengths under the Flickr-25K and NUS-WIDE dataset.

Table 2 :
Comparison results on mean accuracy (MAP@5000) for diferent code lengths under the Flickr-25K, NUS-WIDE, and MSCOCO dataset.