Adaptive Sensitive Information Recognition Based on Multimodal Information Inference in Social Networks

,


Introduction
Privacy is a signifcant concern for online social network users because it afects their ability to control who can access their personal information and how that information can be used. Without efective privacy protection, users may be at risk of having their personal information accessed or misused by others, which can lead to various potential harms such as identity theft, fnancial fraud, or online harassment. By protecting user privacy, online social networks can help ensure that users feel safe when using the network and contribute to building trust and confdence throughout the network. In addition, protecting user privacy is also important for online social networks themselves. If users feel that their personal information is not adequately protected, they may be less likely to use the network, which could reduce the network's user base and decrease its overall value. By protecting user privacy, online social networks can help maintain and expand their user base, which is critical to their continued success. Although users may be reluctant to share personal data, the inherent linkages between public data and private data often result in serious privacy breaches. Te 2021 Data Security Conference has once again received much attention and various voices on data security have emerged. According to incomplete statistics, since 2015, the number of Internet black-gray industry professionals has exceeded 400,000. Although public data show that the domestic network security industry is expected to exceed 60 billion yuan in 2019, the black-gray industry has already reached a scale of 100 billion yuan. Tese studies show that private data are often subject to reasoning attacks, where enemies analyze a user's public data to illegally obtain information about their private data. However, few social network users are aware of the serious dangers of privacy breaches, so in the face of explosive growth in network data, maintaining a safe environment for the dissemination of information in the network community is a domestic and international need for the network environment.
With the widespread deployment of heterogeneous networks, a large amount of high-capacity, high-diversity, high-speed, and high-accuracy data has been generated. Tese multimodal big data contain rich information between modalities and across modalities, which poses a great challenge to traditional sensitive information identifcation methods. Te research on multimodal sensitive information recognition in online social networks is an important research feld in the automatic recognition of sensitive information in online social networks. Tis can be used for various purposes, such as detecting and removing harmful content, protecting user privacy, and developing more effective auditing tools. By identifying sensitive information such as personal information, potential ofensiveness or harmful content, or illegal activities, it helps prevent unauthorized parties from accessing users' personal data, such as fnancial information or login credentials, and using it for malicious purposes, such as identity theft or fraud. In addition, these systems can be used to monitor users' activities and any suspicious behavior on the network, such as attempting to access personal information without permission. By detecting and preventing unauthorized access to personal information, these systems can help protect user privacy and protect their data from potential threats.
Efective detection of widespread sensitive content is a critical issue. Research design of multimodal sensitive information recognition systems is a relatively efective response measure compared to systems that rely on a single mode or method, as they can detect a wider range of sensitive information. For example, systems that only use natural language processing to analyze text may miss sensitive information contained in images or videos, or that is implied rather than explicitly stated in text. By using multiple methods to analyze network content, multimodal systems can provide a more comprehensive view of sensitive information present on the network and can more efectively detect and remove it. Another advantage of multimodal systems is that they can be more robust when faced with attempts to evade detection. For example, users attempting to share personal information on the network may try to hide it in various ways, such as using abbreviations, initialisms, or slang, or by embedding it into images or other nontextual content. A multimodal system that has been trained to recognize various disguises can more efectively detect and prevent this type of sensitive information, while a system that only uses a single mode or method may be more easily fooled.
In general, multimodal sensitive information recognition systems are considered an important research area because they can help make online social networks safer for all users. By automatically detecting and removing sensitive information, these systems can protect users' privacy, prevent the spread of harmful content, and make the development of more efective review tools possible.
In this paper, we attempt to learn the semantics of users' sensitive information in multimodal social network environments. We focus on the application of multimodal data interaction, feature fusion, knowledge perception, and related data mining in the feld of social network privacy protection. Considering the characteristics of social networks, such as the diversity of data types posted by users, the accumulation of historical information leading to the leakage of sensitive information, and the diferences in the defnitions of sensitive information among diferent users, we propose an improved multimodal data fusion dual-channel multihop reasoning mechanism based on information content, data attributes, and user features, considering the background knowledge of users and the historical records of data published in social networks. Te mechanism realizes the interaction of diferent modal data, explores the implicit correlations between different modalities, and determines the meaningful sensitive features in historical data. In addition, based on the userdefned sensitive list, we propose an adaptive multimodal spatial attention mechanism to generate an understanding of user-sensitive information, implement the rapid screening of implicit sensitive information, and prevent privacy information leakage caused by data association.
As shown in Figure 1, in multimodal sensitive information recognition, we consider the potential semantic dependencies in visual and textual contexts, attempt to mine the implicit correlations between multimodal data, and enhance the semantic representation of sensitive information through feature fusion and adaptive attention mechanisms. Given a user's social dynamics (including images, image descriptions, text, sensitive lists, and historical privacy settings), we can improve the accuracy of the decoder's response through iterative interaction, knowledge reasoning, and the fusion of visual and textual features. In this way, we can obtain sensitive items that may reveal the user's privacy.
Te structure of this article is as follows. In Section 1, we will discuss the existing challenges in the feld of identifying sensitive information in online social networks and introduce our proposed solution. Section 2 will provide an overview of previous research on privacy protection in online social networks, focusing on the problems and challenges that motivated our approach. In Section 3, we will describe the user-sensitive data leakage problem that our paper aims to solve. Section 4 will detail the method we propose for identifying sensitive information, including our approach to feature extraction, the improved multimodal semantic strategy, and the multimodal adaptive spatial attention mechanism. In Section 5, we will describe our experimental procedures and analyze the results of our experiments.
Te main contributions of this thesis are summarized as follows: (1) We propose an improved two-channel multihop reasoning mechanism for interactive reasoning of user image and text data in social networks to mine and exploit the implicit correlation between multimodal data. It breaks the semantic gap between cross-modal data and enriches the semantic representation of privacy in query text and image.
(2) User's personal sensitive preference is the difculty of privacy protection technology. We enhance the representation of sensitive information preference by adding user-defned sensitive list and putting it into the two-channel multihop reasoning mechanism and fnally realize personalized user privacy preference. (3) We design an improved multimode spatial attention codec architecture to dynamically select the feature information that requires attention, so as to achieve accurate recognition of sensitive information.

Related Work
As the Internet continues to rapidly develop, a wide variety of social platforms have emerged, and users often provide personal information when using these platforms, including identifcation numbers, phone numbers, addresses, and health data. However, as technologies such as big data, cloud computing, and deep learning have evolved, network privacy vulnerabilities have become increasingly serious. Te security environment of network communities is a major concern for both domestic and foreign users. Maintaining a secure environment for the dissemination of information in network communities has become a major challenge that needs to be addressed.

Research on Privacy Protection.
Existing privacy protection methods for social networks are mostly based on anonymous algorithms for social network privacy protection models and on diferential privacy-based social network privacy protection models. Te former is mainly used for potential privacy attack problems, where the attacker is unable to identify themselves from a dataset consisting of multiple individual records and corresponding sensitive personal data. In order to overcome the weaknesses of traditional k-anonymity, Al-Asbahi [1] used an l-diversity method based on clustering techniques to communicate more substantial privacy protection and structural anonymity. In order to reduce the risk of sensitive information leakage or loss of a large amount of information, Lian and Chen [2] proposed a personalized (α, p, k) anonymous privacy protection algorithm. According to the sensitivity level of sensitive attributes, diferent anonymization methods are used for diferent levels of sensitive values in equivalent classes to achieve personalized privacy protection of sensitive attributes. In addition, location privacy research has a positive efect on preventing user-side write issues. Unlike traditional methods, Teodorakopoulos et al. [3] proposed location histogram dynamic privacy to focus on the efciency of diferent locations being accessed. Ruan et al. [4] proposed an efcient location sharing protocol that supports location sharing between friends and strangers while protecting user privacy.
Te efectiveness of a model in deep learning is proportional to the amount of available data, and large-scale massive data are indispensable. In order to enhance the availability of privacysensitive data in third-party infrastructure, Xu et al. [5] designed a secure computing protocol for the hybrid function encryption scheme, training deep neural networks on multiple encrypted datasets collected from multiple sources; while ensuring the accuracy of the model, the data confdentiality is improved. Despite this, data owners still worry about privacy leaks when providing sensitive data for model training. To solve this problem, Li et al. [6] proposed noninteractive privacypreserving multiparty machine learning, providing an efective communication method for data owners. Similarly, Wang et al. [7] proposed that all sensitive data be operated in ciphertext rather than decrypted during the model training and Text: My work place is far away from where I live. It takes me an hour to drive every day, but luckily it is very close to the Jingjiang Bookstore, which makes reading convenient. Sensitive list: location, number, health information Historical: (Ten social dynamic data and privacy items annotation) Security and Communication Networks epidemic risk prediction stage. Lei et al. [8] considered privacy from a more granular perspective, protecting user facial features based on reversibility and reusability. In addition, data encryption for privacy protection is a common means. Idepefo et al. [9] combined blockchain technology with cryptography, hash, and consensus mechanisms. Xie et al. [10] proposed a hybrid data method based on homomorphic encryption and AES and constructed a multiclass support vector machinebased privacy-preserving medical data sharing system. However, excessively encrypted data will lead to a decrease in the accuracy of social user recommendations. Terefore, Chen et al. [11] improved on the basis of the additive secret sharing scheme and proposed a security comparison protocol and a division protocol, which strengthened the data privacy protection recommendation system. In addition, some scholars are committed to privacy protection research in the Internet of Tings [12][13][14][15], based on context privacy protection and user efective feature-based privacy protection in social networks. Chen et al. [16] proposed a privacy protection optimal nearest neighbor query (PP-OCQ) scheme that implements secure optimal nearest neighbor queries in a distributed manner without disclosing sensitive user information. Li and Zeng [17] presented a novel NRL model for generating node embeddings that can handle data incompleteness resulting from user privacy protection. Additionally, they proposed a structure-attribute enhanced matrix (SAEM) to mitigate data sparsity and developed a communitycluster informed NRL method, c2n2v, to further enhance the quality of embedding learning. Zhang et al. [18] developed a machine learning-based method to calculate malicious services and protect user data through direct and indirect trust, efectively controlling or associating leaked datasets in online social networks (OSN) and establishing a trust evaluation model in OSN. Tese privacy protection technologies are constantly maturing, but in the research process, the multimodality of social network information data has not been considered, and the relevant data analyzed are thin, and it is impossible to integrate multimodal data into the data.

Sensitive Information Identifcation.
Efective identifcation of sensitive information is an efective way to improve privacy protection. Some scholars have studied automatic detection models of sensitive information. Heni and Gargouri [19] showed a method for identifying sensitive information in Mongo data storage, which is based on semantic rules to determine the concepts and language components that must be segmented, retrieves the attributes that are semantically corresponding to the concepts, and implements them as an expert system for automatically detecting segment candidates of attributes. Ding et al. [20] constructed a corpus to train the detection model, applied the BERT method to detect problems, and fnally obtained a BERT-based automatic detection model of sensitive information. Botti-Cebriá et al. [21] proposed an auxiliary proxy to detect sensitive information according to the diferent categories (i.e., location, personal data, health, personal attacks, emotion, etc.) detected in the message. Liu et al. [22] trained sensitive data to establish a decision tree, which can classify and identify known data and can mark and encrypt the identifed sensitive information to achieve intelligent recognition and protection of sensitive information. Kaul et al. [23] proposed a knowledge and learning-based adaptive system for sensitive information identifcation and processing. Gao et al. [24] used image caption technology to track the spread of image information on the network through text. Wang et al. [25] described the underlying reasoning behavior through Bayesian networks, resisting attackers' reasoning attacks on sensitive information. Petrolini et al. [26] developed a classifer that can monitor documents containing sensitive data, making it easier to identify and protect sensitive information. Bracamonte et al. [27] studied users' perceptions of monitoring sensitive information tools, quantitatively and qualitatively applying their reactions. Wu et al. [28] proposed a constraint measure to minimize the spread of sensitive information and relied on the Bandit framework to adaptively execute the spread constraint measure. Singh et al. [29] used local sampling to generate diferentially private sensitive information, generating useful representations while maintaining privacy. Gao et al. [30] proposed a scheme through research that can audit the integrity of all encrypted cloud fles of keywords of interest to users by only providing encrypted keywords to TPA, while unable to infer sensitive information such as fles containing the keyword and the number of fles containing the keyword.
Neerbek [31] proposed to learn the phrases and structures that distinguish sensitive and nonsensitive documents in recursive neural networks. Unfortunately, with the rapid growth of cloud computing and remote workforce, organizations must handle a large amount of unstructured data, so automatic detection and recognition of secrets and sensitive information in structured and unstructured data is particularly important. Ahmed et al. [32] showed us the benefts of using deep learning to identify context-related sensitive information in unstructured data. Botti-Cebriá et al. [33] proposed a method for automatically monitoring sensitive information in educational social networks. Cai et al. [34] frst applied three enhanced techniques in NER to Chinese sensitive information recognition based on the study of unstructured data, greatly solving the uncertainty and ambiguity of Chinese vocabulary and improving the accuracy of sensitive information recognition. However, singlemode data analysis has certain limitations in inferring the sensitive information of the current social network.

Multimodal Feature Fusion.
Te diferent modes of data dissemination make data modes diversifed, so the study of multimodal fusion is gradually applied to various research felds. In the task of sentiment analysis, the importance of single modal data to the emotional result is not constant. With the extension of the time dimension, the emotional attributes of a specifc natural language will be afected by the natural language data. Qi et al. [35] fully considered the long-term dependency between modes and the ofset efect of nonnatural language data on natural language data, solving the long-term dependency within modes. Yan et al. [36] adopted tensor fusion network to model the interaction of multiple modes and achieve the emotion prediction of multimode features. Hu et al. [37] proposed a graph dynamic fusion module to fuse multimodal context features in the conversation. Chen et al. [38] proposed a feature fusion method based on K-means clustering and kernel canonical correlation analysis (KCCA), which produces a higher recognition rate than existing methods (such as aware segmentation and tagging methods). Due to the inherent characteristics of each mode, it is difcult for the model to efectively use all modes when dealing with fusion mode information. Zou et al. [39] proposed the concept of main mode and used the method of main mode transformation to improve the efect of multimodal fusion. Yoon [40] proposed a cross-modal translator that can translate between three modes and can train multimodal models based on three modes using diferent types of heterogeneous datasets. Ghosh et al. [41] developed a multimodal multitask framework that utilizes a novel multimodal feature fusion technique and a contextuality learning module to handle emotional reasoning (ER) and accompanying emotions in conversations. Ten, in the rumor detection task, Wu et al. [42] proposed a new multimodal collaborative attention network (MCAN), which combines multimodality extracted from text, spatial domain, and frequency domain (textual and visual) feature fusion as a method for detecting fake news. Experiments show that MCAN is able to learn correlations among multimodal features. Dhawan et al. [43] proposed an end-to-end trainable framework based on graph neural network (GAME-ON), which allows instant interaction between diferent modalities inside, and evaluated the framework on two efectiveness parameters on publicly available datasets. Azri et al. [44] proposed to use a multimodal fusion framework to evaluate message accuracy in social networks (MONITOR), which adopts supervised machine learning and utilizes all message features (text, social context, and image features) to provide interpretability for decision making. Chen et al. [45] proposed a multimodal fusion network (MFN) to integrate text and image data from social media, which uses self-attention fusion (SAF) mechanism to value for feature-level fusion. On the other hand, video captioning is a very challenging computer vision task. Tey used natural language sentences to automatically describe video clips. Bhooshan, R. S. et al.(2022) [46] proposed a neural structure based on discrete waveletconvolution and multimodal feature attention to generate video subtitles. Gao et al. [47] proposed a new paradigm for encrypted cloud data integrity auditing based on sensitive information privacy keywords. In this scheme, only the trusted third party auditor (TPA) possessing the encrypted keywords can audit the integrity of all encrypted cloud fles containing user-relevant keywords. Te scheme utilizes relationship authentication labels (RALs) to infer which fles contain the keywords and how many fles contain sensitive information related to those keywords. Experimental results demonstrate that the proposed scheme satisfes correctness, audit soundness, and sensitive information confdentiality. In addition, the visual question answering research that has emerged in recent years is also a research hotspot in the feld of computer vision. How to fuse multimodal features extracted from images and questions is a key issue in VQA. Zhang et al. [48] designed an efective and efcient module to reason complex relationships between visual objects. Tey also learned a bilinear attention module to guide the attention on visual objects based on the given question. Tis combination of visual relationships and attention achieved a more fne-grained feature fusion. Chen et al. [49] adopted a dual-channel multihop inference mechanism to reason and fuse image features and text features to achieve cross-modal information interaction. Besides, Wang et al. [50] applied multimodal fusion to similarity user recommendation system and proposed an implicit user preference prediction method with multimodal feature fusion. Combining text and image features in user posts, the image and text features are extracted using convolutional neural network (CNN) and text CNN models, respectively, and then these features are combined as a representation of user preferences using early and late fusion methods. Finally, a list of users with the most similar preferences is suggested. Ding et al. [51] applied multimodal fusion to sarcasm detection and proposed a multimodal-based postfusion sarcasm detection method for postfusion with a three-level fusion structure and residual network model, which can better fuse the three modalities into a unifed semantic space, thereby improving sarcasm detection. Xiao and Fu [52] combined visual language fusion and knowledge graph reasoning to further obtain useful information. In order to efectively detect multimodal sarcastic tweets, Xu et al. [53] constructed the decomposition and relation network (referred to as D&R Net) to model cross-modality contrast in the associated context. In this network, the decomposition network represents the commonalities and diferences between images and texts, while the relation network simulates the semantic associations in the cross-modality context. Sankaran et al. [54] developed a refner fusion network (ReFNet) that enables fusion modules to combine powerful unimodal representations with powerful multimodal representations. Tis approach addresses the large gap in existing multimodal fusion frameworks by ensuring that unimodal and fused representations are strongly encoded in the latent fused space.
Inspired by the feld of visual dialogue, the task of identifying sensitive information on social networks is to fully understand the privacy semantics of users, recognizing privacy items not only from text history but also from visual-based information. In order to achieve our expectations, the following questions need to be considered. Firstly, to ensure the comprehensiveness of the analysis results, we use multimodal data (images and text) to decompose and integrate diferent pose data features, which is a daunting task. Secondly, how to make our designed reasoning mechanism similar to the visual dialogue process, constantly adjusting the fnal conclusion through the obtained information. Finally, since user-sensitive information varies from person to person, how to incorporate user-sensitive preferences into the information reasoning mechanism, enhance sensitive semantic information, and achieve the goal of personalized protection of user privacy.

Problem Description
Te frst problem we have to face is how to get a large and diverse set of sensitive data. Tis is a common challenge in the feld of sensitive information recognition, as there are no Security and Communication Networks publicly available and widely accepted sensitive information datasets. In fact, all existing sensitive information recognition schemes rely on private datasets that cannot be accessed for free, which is understandable because publishing sensitive information may be illegal. To overcome this limitation, we decided to manually collect and annotate real data for the sensitive information recognition task. Tis enabled us to create a dataset that can be used for training and testing our model without violating any laws or regulations. Here are three examples of social updates from a user in the past month that we believe may reveal privacy and have potential risks.
As shown in Figure 2, Mike posted a picture of their new car on an online social platform with the caption "Just bought this beautiful car! Can't wait to take her for a spin!" Te image clearly shows the car's manufacturer and model, as well as the license plate number. If this information is viewed by the wrong person, it could potentially be used to locate and steal the user's car.
As shown in Figure 2, Mike posted a message saying "I just got a new job at ByteDance! I'll be starting there next week, and I'm really excited." While this message may seem harmless, it could be sensitive if the user has not yet notifed their current employer of their resignation plans. Te information in this message, if viewed by the wrong person, could be used to harm the user's current employment situation or steal their identity.
As shown in Figure 2, Mike posted a picture of himself and his family with the caption "Having a great time on vacation in Chengdu! I can't wait to explore more of the city tomorrow." Te picture shows the user and their family standing in front of a well-known tourist spot in the city, and the caption includes the name of the city they are vacationing in. In this case, the user's request to post the picture and caption on a social platform is likely to reveal sensitive information, such as their current location and the fact that their home may be unoccupied. If this information is viewed by the wrong person, it could potentially be used to locate the user and potentially break into their home while they are away on vacation.
It can be seen from the above social updates that it is always important to be cautious about the information shared on social media platforms, as it can potentially be used by others in harmful ways. If we only focus on singlemodality data information, it will be difcult to fully capture the semantic information contained in the data, and it will be difcult to infer potential sensitive information. What is more troubling is that in multiple dynamic examples, we can include information related to natural persons in sentences other than those containing user-sensitive information, or even in any sentence. Terefore, it is important to look for privacy information in social updates within a certain range, which leads to the risk of false positives. In our examples, the task goal is to fnd data that may contain sensitive information in dynamic data within a certain range, which means that we need to reduce the false negative rate as close to zero as possible. Terefore, it is important to research and design multimodal semantic analysis strategies.
Petrolini et al. [26] introduced the concept of "sensitive topics" in their research on sensitive information, which is helpful in judging whether a sentence is sensitive information based on the analysis of its topic. Unfortunately, this unbiased approach ignores the user's personalized sensitive preferences. We have added diferent users' sensitive lists to solve this problem. We collected the user data and grouped the sensitive topics according to the user's sensitive list. Table 1 lists the fve main sensitive items for 50 users. For these sensitive items, we searched for their hottest posts and related comments to obtain information about elements that are likely to be related to sensitive topics.

Architecture
In the process of protecting privacy data on social networks, understanding user-sensitive information is a critical bottleneck, which typically requires analyzing the user's historical resource data and historical access control settings to continuously adjust to determine the user's sensitive preferences. Tis process requires multiple adjustments and inferences. Using a multimodal data bi-channel multihop reasoning mechanism to determine user-sensitive preferences can help to use the rich potential information between multimodal privacy data to generate access control privacy permissions.
As shown in Figure 3, in this study, we focus on improving the two-channel multihop inference mechanism proposed by Chen et al. [49] for extracting sensitive information from user-posted resource data on social networks. First, we represent privacy information of users' historical texts and images with feature representations. All modal privacy feature representations are iteratively interacted through the two-channel multihop inference mechanism. After multihop interaction, the multimodal features are fused through the attention mechanism and fnally input to the decoder to generate understanding of user-sensitive information.

Feature Representation.
Te input of this encoder is social dynamic text, sensitive list, historical privacy setting, image description, and image, and the output is language and visual pattern learning feature representation. As shown in Figure 2, the text and image are, respectively, passed through Bi-LSTM [55] and pretrained Faster R-CNN [56] to obtain the corresponding feature vector, to prepare for multimodal feature interaction reasoning.

Text Feature Representation.
Bi-LSTM has been widely used in contextual text feature extraction, which can process historical and future information in sequence data and capture long-term dependencies in sequence arrays, thereby improving the accuracy and efciency of sequence modeling tasks. Te text input of the task includes the dynamic text D q of user U 1 , the picture description P q , and the sensitive list L. For the convenience of processing, we combine the dynamic text and picture description to generate resource text T q and use the pretrained Glove to vectorize the input text data.
Just got this beautiful car! I can't wait to take her for a ride! Mike (a) I just got a new job at ByteDance! I'm excited to start working there next week.

Mike
What a wonderful holiday in Chengdu! I can't wait to explore more cities tomorrow  We combine dynamic text and picture descriptions to generate the resource text Tq. Ten, we use pretrained Glove to vectorize the input text data, resulting in the word embeddings of the resource text, Tq � {tq1, tq2, tq3, . . ., tqm}. Tis allows the text vectors to contain more semantic and grammatical information. Ten, we use Bi-LSTM to generate the hidden sequence b � b q1 , b q2 , b q3 . . . , b qm } and use the last hidden state as the generated resource text feature t q , as shown in equations (1) Historical privacy settings L and dynamic sensitive items S are embedded in the same way and combined with Bi-LSTM to generate historical privacy features H q � h q1 , h q2 , h q3 . . . , h qn and sensitive item features S q � s q1 , s q2 , s q3 . . . , s qn .

Image Feature Representation. Faster R-CNN on
ResNet-101 pretrained on Visual Genome data implements bottom-up attention to extract visual features of salient regions in input images. We took this model and pretrained it on the Visual Genome data. Specifcally, we employ the Faster R-CNN framework to obtain object detection boxes in input images. Ten, nonmaximum suppression is performed for each object region, and the top K (K � 36) detection boxes are selected, each with a feature size G (G � 2048). For each selected region proposal i, defne v i to be the average pooled convolutional feature for that region, so that the fnal representation of the input image is shown in the following equation: Tis approach uses Faster R-CNN as a "hard" attention mechanism, since relatively few image regions are selected from a large number of possible confgurations. In addition, we also recorded the scaled geometric features of selected image regions, denoted as B � b 1 , b 2 , . . . , b K , and b i is shown in the following equation: where w i and h i are the coordinates, width and height of the selected region i, respectively. w and h are the width and height of the input image, respectively. Tese scaled geometric features will be input into our multimodal dualchannel information inference module.

Multimodal Sensitive Information Reasoning.
Multimodal data contain not only intermodal information but also rich cross-modal information. In order to learn the rich intermodality and intersectionality information in multimodal sensitive information data, most of the existing multimodal deep learning models frst use a deep model to capture the private features in the modality and combine the modalityspecifc original. Te representation is transformed into a high-abstract representation of a certain global space. Ten, these high-abstract representations are further concatenated into a vector, which represents a multimodal global representation. Finally, a deep model is used to model the high abstraction of the connected vectors [57]. However, the  representation between modalities using this method is connected in a linear manner, which cannot adapt to complex relationships on multiple modalities and cannot capture full semantic knowledge of multimodal data. It can be seen that the combination of deep learning and semantic fusion strategy is an efective method to solve multimodal data fusion.
Tis section focuses on a new semantic fusion strategy, which is used to input multichannel sensitive information of user characteristics through a multimodal dual-channel multihop reasoning mechanism. Te dualchannel multihop reasoning is used to mine the hidden semantic association between multiple modalities and jointly perform in-depth reasoning on sensitive information. Tis mechanism is mainly used in the feld of visual dialogue and has a good efect on the questionanswering mechanism.
As shown in Figure 4, the dual-channel sensitive information multihop reasoning mechanism is realized through two modules, namely, the image module and the text module. Te image module fully understands sensitive semantic information through image features, and the text module fully understands sensitive semantic information from historical privacy features. Te reasoning path of the image module is I 1 ⟶ H 2 ⟶ I 3 ⟶ . . . ⟶ I n , and the reasoning path of the text module is H 1 ⟶ I 2 ⟶ H 3 ⟶ . . . ⟶ H n . After the two modules are built, the output of the two modules needs to be iterated multiple times. Interaction and synchronous capture of information can not only make use of the hidden associations in text and images but also greatly enrich the understanding of sensitive semantic information.

Image Module Initialization.
Te image module is designed to enrich the semantic representation of sensitive information from images. Te input of the image module is the query text t picture and the image feature v, and then it outputs the perceptual representation of the image privacy features. First map these feature vectors to the d picture dimension vector and then use the attention mechanism to calculate the soft attention of all target detections, as shown in the following equation: where f represents a two-layer perceptron with ReLU activation, the dimension of the input feature is d, W s is the vector matrix of softmax activation, and o is the Hadamard product. Te privacy-aware attention weight α is obtained by the following equation: Ten, the privacy-aware attention weight is applied to the image feature v, and the privacy-awareness of the image is calculated by the following equation:

Text Module Initialization.
Te text module is designed to enrich the semantic representation of sensitive information from historical texts. Te input of the text module is the query text feature t text and the historical privacy feature h, and then it outputs a query-aware representation of the text privacy features, as shown in equations (9) and (10): where f represents a two-layer multilayer perceptron with ReLU activation, which converts the dimension of the input feature to d text . W z is a vector matrix with softmax activation, and ∘ is the Hadamard product. From the above equation, we obtain the attention weight η for query perception. Next, the attention weight of query perception is applied to the historical privacy feature h to calculate the query-aware representation of historical privacy, as shown in the following equation: Next, apply h ∧ to the two-layer perceptron, with ReLU activation in the middle, and then add the representation of the sensitive list s to enhance the representation of historical text features on sensitive semantics, and fnally obtain the Security and Communication Networks perceptual representation of historical privacy, as shown in equations (12) and (13): t out text � LayerNorm(g + s).

Dual-Channel Multihop
Reasoning. After initializing the image module and the text module, it is necessary to iteratively interact with the information of the two modules and deeply dig and utilize the implicit relationship between the image and the text. Multihop inference includes two types of multihop inference. One is the multihop inference starting from the image and ending with the image, as shown in Te other is the multihop inference starting and ending with historical privacy, as shown in We implement each inference path through an image module and a text module. Reasoning path 1 (starting and ending with the image): After initializing the image module with the image feature v input by the user and the text t q to be queried, t out 1 picture is obtained through the calculation of the image module and then combined with the historical privacy feature h and input into the text. In the module, t out 2 text is calculated and then combined with the image feature v, and it is input into the image module to get t out 3 picture . Tis is an interactive reasoning process, and then it iteratively proceeds in this way. Finally, the inference result t out n picture of the image module is obtained. Te specifc process is as follows: Step 1: Picture((t q ), v) ⟶ t out 1 picture Step 2: Text ((t   out 1 picture ), h) ⟶ t out 2 text Step 3: Picture ((t   out 2 text ), v) ⟶ t out 3 picture Repeat steps 1, 2, 3 iteratively. . . Reasoning path 2 (starting and ending with text): After initializing the text module with the historical privacy feature h, privacy list feature s, and query feature t q input by the user, t out 1 text is obtained through the calculation of the text module. Afterwards, the image features v are input into the image module to compute t out 2 picture . Ten, the historical privacy features h and the privacy list features s are combined and input into the text module to obtain t out 3 text . Tis is an iterative process of interactive reasoning, and the computation continues in this manner. Finally, the inference result t out n text of the text module is obtained. Te specifc process is as follows: Step 1: Text((t q ), h, s) ⟶ t out 1 text Step 2: Picture ((t   out 1 text ), v) ⟶ t out 2 picture Step 3: Text ((t   out 2 picture ), h, s) ⟶ t out 3 text Repeat steps 1, 2, 3 iteratively. . .
Trough the dual-channel multihop reasoning mechanism, the fnal result of multimodal feature interactive reasoning can be obtained, preparing for subsequent feature fusion.

Multimodal Fusion.
Before fusing the polymorphic representations t n picture and t n text generated by the tracking module and the positioning module, we use the text feature t to be queried to enhance the representations of t n picture and t n text as follows: where f represents a two-layer perceptron with ReLU activation. After obtaining the enhanced polymorphic representation, the representations of the two modules are fused, as shown in equations (16) and (17):

Multimodal Adaptive Spatial Attention Decoder.
A multimodal spatial attention decoder is a neural network architecture for combining information from multiple modalities, such as audio, video, text, and image data, to make predictions or perform other tasks. It uses an attention mechanism to measure the importance of each pattern in a given context and then combines information from all patterns in a way that allows the network to make more accurate predictions.

Attention Mechanism.
Te essence of the attention mechanism is to locate the information of interest and suppress useless information. Te attention mechanism in the multimodal spatial attention decoder is used to measure the importance of each modality in a given context. Tis means that the network can focus more on one mode than another, depending on the specifc task requirements, for example, if the network is trying to recognize a word being spoken, it may focus more on audio data than visual data; once the neural network weighs the importance of each modality, it combines information from all modalities to make more accurate predictions, which may involve simply concatenating information from all modalities or may involve more complicated processing. Te exact details of how a multimodal spatial attention decoder performs this fusion will depend on the specifc architecture of the network.

Decoder.
Te multimodal decoder employed in this paper is an improvement on the adaptive spatial attention decoder. A recurrent neural network-based approach is adopted that not only focuses on meaningful information but also decides as needed whether to rely on visual information or a language model to predict the next word in a sentence. Te multimodal recurrent neural network can bridge the probability correlation between images and sentences, which solves the problem that new sensitive information cannot be generated when retrieving corresponding sensitive information in the sentence database based on learned image-text mapping in previous work. Unlike previous work, the recurrent neural model learns a joint distribution in semantic space given words and images. When multimodal features are present, it is possible to analyze the temporal dependencies hidden in multimodal data with the help of explicit state transitions in hidden unit calculations, using the time backpropagation algorithm to train parameters, and generate verbatim from the captured joint distribution sentence. As shown in Figure 5, in the encoder-decoder framework, the log-likelihood of the joint probability distribution can be decomposed into ordered conditions by using the multimodal fusion feature representation and the features to be queried in the previous stage, using the chain rule, as show in the following equation: Each conditional probability is modeled using a recurrent neural network, as shown in the following equation: where f is a 2-layer perceptron activated by ReLU and h t is the hidden state of RNN at time t. In this paper, LSTM is used to model h t , as shown in the following equation: where y t−1 is the representation of sensitive information generated at time t − 1.
Given the query feature t, historical privacy feature h, image feature v, privacy list s, and hidden state h t , we input them through a single-layer perceptron with a softmax function to generate query features and sensitive list features, T round history privacy, and 4 attention distributions over K object detection features per image. Te spatial attention model defnition of the multimodal context vector c t is shown in the following equation: Te frst is the historical privacy vector m h , which is defned as follows: where E is a vector with all elements set to 1 and W q , W h g , W h h are learning parameters. Afterwards, the query vector m_t is obtained as follows:  list vector m s , and then fuse these three context vectors to obtain the context vector c t , as shown in the following equation: where [·] represents the multiplication between vectors, W e , which denotes the learnable parameters, is used to compute the vector, c t and the vector ct is then combined with h t to predict the next wordy t+1 . c t is the multimodal context vector at time t. In the attention-based framework, c t depends on both the encoder and the decoder. At time t, the decoder will focus on specifc areas of text and images according to the hidden state. In order to improve the adaptive ability, the extended LSTM is used to obtain the visual sentinel s t , as shown in formulas (26) and (27): where W x , W h are learning parameters, g t is the gate applied to the storage unit m t , and x t is the LSTM input at time t.
Based on the visual sentinel s t , the multimodal context vector c t is calculated by an adaptive attention model, as shown in the following equation: where θ t is the new sentinel gate at time t. When θ t is 1, it means to use visual marker signal, and when θ t is 0, it means only spatial image information is used when generating predicted words. θ t is calculated by the attention distribution α t on the spatial image, and the calculation process is shown in equations (29) and (30): In addition, we use the encoder output e ∧ as the embedding to initialize the input of the decoder LSTM, as shown in the following equation: where t q is the last state of the query LSTM in the encoder and h 0 is used as the initial state of the decoder LSTM.  Figure 6, we performed object recognition on two images and marked the objects whose recognition prediction values were greater than 50% in the images. In Faster R-CNN, the feature maps of each layer refect diferent levels of image feature information. Generally, the shallow feature maps can refect some low-level features of the image, such as edges, corners, and textures, while the deep feature maps can refect some high-level semantic information, such as the shape and texture of objects. Tese feature maps can serve as inputs for subsequent target classifcation and localization, helping to locate and identify targets. To better understand the information contained in each layer of feature maps, we performed a layer-by-layer feature map output analysis of the image, as shown in Figure 7.

Text Preprocessing.
We convert all text data to lowercase, set the maximum lengths of dynamic text, image description, and sensitive list to 25 for dynamic text, 30 for image description, and 20 for sensitive list, and then construct a secondary markup vocabulary. We utilize distributed word representations with default parameter settings on preprocessed text datasets and incorporate pretrained glove models to construct the vocabulary for the dataset. We obtained word embedding features for each word in the dataset. One reason for choosing to use word embedding instead of one-hot encoding to represent words is that in one-hot encoding, when the vocabulary size is too large, insufcient text may lead to poor word features.

Results and Analysis.
Our proposed model architecture consists of multiple modules, and in this experiment, we compare our work with unimodal and multimodal models and evaluate the impact of our designed reasoning module and multimodal spatial attention mechanism on contribution to the fnal prediction accuracy. We train the following comparison models on our collected real-world data and show the performance of diferent comparison schemes in Table 2.  Te experiment shows that single-modal feature analysis has limitations in outputting sensitive information, and processing multimodal data can enhance the representation of sensitive information semantics in complex social environments and relationships. Our model has good performance in online social user-sensitive information inference.

Conclusions
Tis paper improves the spatial attention decoder by proposing a multimodal adaptive spatial attention decoder. It combines a dual-channel multihop reasoning architecture to perform deep reasoning and prediction on user's historical sensitive data. Tis mechanism not only enables interaction between images and text but also allows for a thorough exploration and utilization of their implicit correlations. When predicting sensitive information, by paying attention to the context and context information of text and images and adaptively switching attention between visual information and language models, the fexible and accurate identifcation of sensitive user data is achieved, and in our study, from 50 volunteers, good results have been achieved in the data collected by the authors. Afterwards, this work will be combined with social network work access control to eliminate identifed privacy items or set corresponding access rights.
In addition to the lack of privacy semantics caused by data diversity, another key challenge to protect online social network data privacy is the dynamic nature of data. Because data are constantly changing, it can be difcult to ensure that privacy is maintained over time. Traditional approaches to learning from dynamic multimodal data, such as training a new model every time the data distribution changes, can be timeconsuming and impractical for online applications. Terefore, online learning and incremental learning have emerged as promising real-time learning strategies for multimodal data fusion. Tese methods allow new knowledge to be learned from new data without losing large amounts of historical knowledge, making them well suited to the dynamics and uncertainty of online social network data. In the following work, we will try to solve the privacy protection challenges brought about by the dynamic changes of multimodal data by designing online and incremental multimodal deep learning models.

Data Availability
Te data used to support the fndings of this study are available from the corresponding author upon request.

Conflicts of Interest
Te authors declare that they have no conficts of interest. Te bold font in Table 2 shows the evaluation results of our model approach in the same data set.