An Automatic Privacy-Aware Framework for Text Data in Online Social Network Based on a Multi-Deep Learning Model

,


Introduction
With the rapid development of Internet technology, a large number of social networking platforms have been established, satisfying the social needs of Internet users.A large number of Internet users have signed up on several social networking platforms through which they can share a multitude of information.In particular, most of these platforms (e.g., Facebook, Twitter, and microblogs) permit users to share their opinions, feelings, snippets of their lives, and political commentaries.At present, social networks play an important role in the daily lives of many people.In this regard, these online networks have changed the way individuals perceive the world, as people now have the convenience of communicating information directly without boundaries [1].
As of May 2022, the total number of online social networking (OSN) users has reached 5.4 billion on more than 300 OSN platforms [2].For example, Weibo, WeChat, Twitter, Facebook, and other social networking platforms have more than 1 billion users.Most online social platforms encourage users to express themselves through their platforms because users share content that is more attractive to others than professional content, increasing engagement on the platform [3].Users frequently share information publicly on social networks or public online platforms, which commonly contains a large amount of personal information.However, the indiscriminate spread of such content online can endanger privacy information, consequently exposing users to many risks [4,5].For example, sharing travel information may allow burglars to know that you are not at home, giving them the opportunity to break in and steal your belongings.In addition, sharing information about a new house may attract a lot of calls from decoration companies or intermediary telephone promotions.Moreover, sharing labor remuneration may invite telephone fraud.Te disclosure of such information causes endless security incidents that range from discrimination or cyberbullying to fraud and identity theft, which afect, and even threaten, the lives of Internet users [6].Terefore, we must analyze how users manage their privacy needs on social networks to identify which information that involves privacy leakage is of great signifcance in making users more aware of how to prevent privacy issues.
Te leakage of privacy information on social networks has triggered considerable research.Te direct approach is to protect privacy information in social networks [7].Researchers have proposed k-anonymous-based privacy data protection technology [8,9], data perturbation technology [10,11], cryptography-based privacy protection technology [12], and diferential privacy-based privacy data protection methods [13,14].Te premise of these technologies is to identify privacy data in social networks and then conduct the corresponding privacy processing.Te major technology is to use privacy information scale to identify privacy information.However, this method can only identify specifc categories of privacy information because indirect privacy leakage does not exhibit a good protection efect.Simultaneously, some researchers have started from the dynamic characteristics of social networks to protect privacy in social networks [15].A number of experts have proposed to analyze privacy in dynamic social networks by using privacy propagation and accumulation [16,17], along with centralized [18] and decentralized technologies [19] for privacy protection.Other researchers have also proposed the use of compressed sensing technology to protect the privacy of dynamic social networks [20].Although these methods can protect privacy, they cannot meet the specifc needs of individuals.
Studies have utilized natural language processing (NLP) technology to censor the content published by users automatically.However, the audit primarily focuses on the automatic censorship of political tendencies, dirty language and hate speech [21][22][23], false news testing [24], and spam review [25].Tese techniques do not involve reviewing personal and sensitive information.Notably, the content shared by network users can be crawled by third-party crawler software and analyzed to obtain the corresponding commercial value.To avoid shared content from being crawled, network users either set permission for the information to be visible to friends only or make the information visible for only 3 days.Although these strategies prevent the spread of personal shared content on the network in time and space, they do not process the sensitive information in the user's text content.Moreover, they can still cause the leakage of the user's sensitive information within a small range.Some Internet users also use automatic disinfection technology to process the shared content.Nevertheless, existing automatic disinfection technologies replace sensitive terms in specifc areas (medical neighborhoods and criminal records) with ordinary personal sensitive terms and delete some sensitive terms.Tese methods exhibit a high degree of ambiguity, and the deleted information may cause poor readability of the original information [26,27].Te use of deep learning (DL) and machine learning (ML) techniques for privacy classifcation or the recognition of privacy entities in the text shared by users in social networks has attracted the attention of researchers.Accordingly, some models have been proposed.However, given the complexity of social network text, variation of text length, existence of nested privacy entities, and other problems, these proposed models cannot solve aforementioned issues.
Researchers have studied self-disclosure in social networks, which is mostly done unconsciously [28,29].Existing research on self-disclosure behavior primarily utilizes the questionnaire survey for privacy information in specifc felds.Ten, the risk of the self-disclosure information of users in the corresponding feld is obtained by analyzing the questionnaire survey.Tis method is laborious, and the results obtained cannot be widely used.Tese self-disclosure studies only provide information on whether a privacy leak occurs.Meanwhile, some studies have classifed selfdisclosure privacy information into several relatively broad categories, utilizing ML algorithms to predict the categories of user privacy self-disclosure [30].Other researchers label text as sensitive or nonsensitive and then use the DL model for text classifcation [31].Tese methods only provide qualitative knowledge of user self-disclosure but do not point to specifc sensitive information locations of user self-disclosure.
Researchers have also used reasoning attacks to infer privacy about OSNs.Graph perturbation defense graph neural network has been used for privacy reasoning [32].Bayesian inference and individual privacy diference rules have been adopted to deduce user privacy [33].Adversarial training techniques, such as overlapping technology, have been applied to deduce and protect the sensitive information of users [34].Tese reasoning techniques can only detect specifc types of privacy information, not multiple types of privacy leakage reasoning.
To address the aforementioned issues, Li et al. proposed a theoretical framework for privacy computing from the perspective of the entire lifetime of privacy information [35].Te work in the current research belongs to the privacyaware link of privacy computing, which is the primary component of the whole theoretical framework.Our objectives are as follows: to be able to automatically perceive the text-sensitive information shared in the social network of users, to accurately locate which part of the text is leaking sensitive information, and to send these privacy data as feedback to users to improve their privacy-aware (PA).We propose a framework for automatically identifying privacy entity in social text, as shown in Figure 1.Tis framework is composed of two parts.
Te frst part is the direct privacy module, which primarily uses the named entity recognition (NER) method to extract direct privacy entity.Direct privacy entity refers to privacy information that is directly exposed in the text, including basic personal information (e.g., height, weight,  [36]. Te second part is the indirect privacy module.In our experiment, some indirect privacy leaks are difcult to uncover, such as "I want to travel with Anlics, and I will not come back next month."With the NER model, this sentence identifes no sensitive privacy information.However, this sentence discloses the individual's personal interests in travel.Te model designed in this study is primarily used to identify privacy information about interests because most the information leakage on interests occurs when users inadvertently share information that can be obtained by attackers.Te interest information in this study mostly includes lifestyle, design aesthetics, games, sports, variety shows, flm and television, fnance and economics, tourism, motherhood, animation, reading, and food.Tis part combines the roformerBERT model and the UniLM framework to build a user interest privacy inference (IPI) model.Te IPI model not only infers which privacy information is leaked in social text but also ofers information on which corresponding text causes indirect privacy leakage.
Te contributions of this study are summarized as follows: (1) We propose a PA framework for OSNs that can automatically sense sensitive text information shared by users in social networks and accurately locate which part of the text leaks sensitive information.
(2) We construct two models.Te remainder of this paper is organized as follows.Section 2 introduces the related works.Section 3 presents the PA framework composition and the key components of the model.Section 4 describes the details of the experimental design and discusses the experimental results.Section 5 concludes the study.

Related Work
Tis section introduces relevant research from two aspects: the traditional study of personal privacy in social networks and the perception of personal privacy based on NLP and DL.

Traditional Research on Personal Privacy in Social
Networks.Personal privacy information in social networks has long been widely examined by many scholars.Researchers have proposed a variety of diferent methods that can be generalized into two major research directions.
Te frst direction is privacy information measurement in social networks [37,38], while the second one is privacy protection in social networks [39].In social network privacy information measurement, Buchanan et al. used a questionnaire to compute the privacy scale of multiple dimensions; a reliable and efective social network privacy measurement was eventually obtained by verifying the validity of diferential data [40].Srivastava and Geethakumari surveyed the possible privacy leakage problem in the network world, computed the privacy coefcient of users, and proposed an unstructured privacy measurement model to measure the degree of privacy information leakage in the text data published by users [41].Tese privacy measures are comparatively simple and biased toward specifc research, and these are difcult to adapt to the current complex network environment.Serfontein et al. utilized a selforganizing map to recognize possible risk in networks [42].Alsarkal et al. quantifed the degree of privacy disclosure that might lead to co-disclosure among friends.By researching the diferences between self-disclosure and codisclosure on various privacy disclosures, users can utilize diferent protection strategies for various privacy sources [43].Shi et al. used static network structure entropy in a complex network structure to measure privacy.Defned as the privacy measurement indexes (PMI), it measures the privacy protection ability of a graph structure.Finally, they used PMI to design a graph of a privacy protection classifcation scheme [44].Tis scheme considers users' friends and privacy leakage measures [8].Nevertheless, if a user has friends, then analyzing each user is inefcient and afects the fnal privacy measures.At the same time, these measurements simply quantify privacy data in social networks, allowing users to know how much privacy they are leaking.However, users do not know which data have been leaked by their friends, and they cannot take the initiative to protect privacy leakage.
In the case of research on the protection of social information privacy, the researchers proposed a technology for privacy data protection based on k-anonymity, data perturbation technology, cryptography-based privacy protection technology, and diferential privacy-based data International Journal of Intelligent Systems protection methods.Privacy data protection technology based on k-anonymity mostly applies k-anonymous model to social networks to generalize and hide privacy data [10].However, this technique does not satisfy the diversity of privacy properties.Data perturbation technology is largely based on the idea of data randomization evolution, which uses data randomization to encrypt sensitive information [12,45,46].Privacy protection technology based on cryptography provides diferential social network privacy protection [13,47,48] and homomorphic encryption network data privacy protection [49].Te premise of these technologies is to recognize the privacy of data in social networks and then conduct analogous privacy processing [50].Most of these technologies are used to identify privacy information through a privacy information measurement scale.However, this method can only recognize specifc categories of privacy information and cannot exert a good protection efect on indirect privacy leakage.

Perception of Personal Privacy Based on NLP and DL.
With the rise of NLP technology and DL, many scholars have utilized these technologies to analyze privacy data.Various models and methods have been proposed.
Vasalou et al. proposed the concept of a privacy dictionary and designed this dictionary by utilizing NLP technology, traditional privacy theory methods, and prototype theory.Teir objective was to help researchers with the automated content analysis of texts, which is a valuable addition to the tools available for privacy research [51].Gill et al. modifed the privacy dictionary proposed by Alastair and used corpus linguistics to construct and validate eight dictionary categories from empirical materials within a wide range of privacy-sensitive contexts.Te generated dictionary combined with LIWC software can quickly recognize privacy information in text.Although this privacy dictionary approach can provide high precision, it has poor recall because it relies only on the count of sensitive words in a document, regardless of the context in which the words are used [52].
Xu et al. constructed a text-sensitive content detection model by utilizing Text-CNN in convolutional neural networks (CNNs).Compared with recurrent neural networks, Text-CNN can simultaneously process multiple flters in parallel while ensuring the same detection efect.In addition, the training time of the model is lower, and detection speed is faster when using Text-CNN [53].Mehdy et al. used NLP to process text and obtain text features, such as linguistic labels, syntactic dependencies, entity relations, and other features.Ten, a CNN model was trained with the obtained text features.Te trained model can recognize whether text has a privacy leakage risk.Teir proposed method is essentially a binary classifcation model that can recognize whether text has privacy leakage [54].Tese methods use the corresponding technology to obtain text features to train the corresponding prediction model.Ten, they eventually apply the trained model to the privacy perception of the text.Nevertheless, these methods do not adequately consider the context characteristics of text data and exhibit poor interpretability in prediction.Users only know that privacy leakage exists when using them.However, they do not know which privacy leakage occurs.
Li et al. employed the NER model (BI_LSTM-CRF) to identify privacy entities in Twitter.Tey divided privacy into four parts, and F1 fnally reached 84% [55].Wu et al. used the DL and ontology models to identify privacy information in Twitter.Tey also classifed four privacy entities and used the privacy ontology model to subdivide the privacy.However, prediction accuracy was not sufciently accurate, and the recognition accuracy values of event and trait were only 64% and 76%, respectively [56].Li et al. utilized graph convolutional network (GCN) to measure the privacy leakage of microblog data users.Teir method can efectively extract the privacy measure of users in social networks [57].Tese research strategies can recognize privacy data.However, the semantics of social networks is complex, and the existence of privacy data must be combined with specifc entities before privacy leakage occurs.

Methodology
Tis section presents the design of the privacy-aware (PA) framework for social networks and the key components of both the DPER and IPI models.Before introducing the model, we summarize the main notations in Table 1 to understand the following model calculation process.

Privacy-Aware (PA)
Framework Architecture.To better solve the problem of OSN privacy perception, we designed a new PA framework.Tis framework consists of the DPER and IPI models.Specifcally, this framework is composed of the GP algorithm, BI_LSTM model, roformerBERT model, and UniLM framework.Te overall fowchart is presented in Figure 1.Te feature representation of the PA framework is provided in formula (1), and we defne X in and H PA as the input and output features, respectively, of the PA framework.
where X in is a feature processed with embedding and [:] represents a connection operation.g RFB represents the roformerBERT model operation that contains rotational position encoding operations.g BL represents the BI_LSTM model operation, which is primarily used to extract the sequence feature information of sentences.g GP represents the GP algorithm operation.g U represents the actions processed by the UniLM framework.g D and g E represent encoding and decoding operations, respectively.Moreover, the activation function used in each DL is the RELU function.Te fnal output layer of each model is processed using the softmax function.

DPER Model
In particular, the roformerBERT pretraining model is used to train the model, which can not only learn text features deeply but also better solve the imbalance problem of privacy entity distribution.Te BI_LSTM model is primarily used to extract the sequence features of sentences.Te GP algorithm is used to solve the nested problem of

Output user privacy that social networks may leak
Output: There are leaks of basic personal information (birthday) and location (the Palace Museum).Step 1. Te lexical text Vocab.txtfle coming from the BERT pretraining model, which is a text mapping of a word to a word number, is used to convert the words in the input text into the corresponding number.Ten, tokenization operation is conducted to obtain the position embedding vector and the text embedding vector.
Step 2. Te converted text-embedding and positionembedding vectors are fused by embedding to obtain the feature expression of the text data.Tis feature expression can ft into the input of the BERT pretrained model.
Step 3. Te embedding-processed feature vectors are reencoded using the rotational encoding algorithm.Te purpose of reencoding is to change absolute position coding to relative position coding, increasing the amount of input to the data.
Step 4. Te transformed encoding vectors are processed by the BERT model to obtain the feature expression of the text data.A (1 * L) number vector is learned through the BERT model to obtain a (348 * L) matrix, which can better represent hidden features in the text.
Step 5. Te data processed by the BERT model are imported to the BI_LSTM model for processing.Processing with the BI_LSTM model yields data with sequence feature information.
Step 6. Te data obtained in the ffth step are passed on to the GP layer for calculation.Te GP layer divides the input tensor into fve matrix outputs.Te fnal privacy entity category of the output is eventually determined by performing the calculation on each matrix.

BERT Pretraining Model.
Te BERT model adopts the encoder unit of the multi-layer transformer to enable the multi-layer encoder to learn the pretraining model of general knowledge through pretraining tasks and to transfer the model to complete downstream tasks.Te BERT model structure is mainly composed of multiple layers of the embedding layer, as shown in Figure 3. Te embedding layer of BERT consists of three parts: segment embeddings, position embeddings, and token embeddings.Te token embedding layer is the normal embedding layer.Te segment embedding layer is used to handle the classifcation task of input sentence pairs.Te position embedding layer is the position encoding of words in a sentence.Overall, the BERT model is a combination of multiple embedding layers and attention mechanisms.Te embedding layer plus an attention mechanism is the (1) Absolute position information is added to q and k through formula (3) operation: where f is an operation that indicates that q and k will have the absolute position information of m and n after f operation.m and n indicate absolute location information.(2) By using the idea of the inner product calculation in the attention mechanism and the conjugate calculation of the complex numbers, the inner product is transformed into a form that can only be dependent on the relative position m-n [43,58].In this manner, absolute and relative positions are skillfully fused together, as shown in the following formula: (4) where Re[] denotes considering the real part of the result.e imθ and e inθ are representations that add imaginary parts to  q m and  k n for calculation, respectively.* denotes conjugate calculation.

GP Algorithm.
Te GP algorithm uses global normalization ideas to conduct entity recognition.It can recognize nested and nonnested entities without distinction [59].Te GP algorithm works better than the conditional random feld (CRF) in nonnested (Flat NER) cases.It also yields better results in nested (nested NER) cases.Te specifc algorithm idea is presented in Algorithm 1. Te mathematical calculation expression of the GP algorithm is provided as formula (5).f QDense represents the fully connected operation for computing the query matrix, and formula ( 6) is its calculation procedure.f KDense represents the fully connected operation for computing the key matrix, and formula (7) is the calculation procedure.f Rope represents the RoPE rotary encoding operation, and formula ( 8) is its calculation procedure.g BL represents the BI_LSTM model operation.
3.2.4.DPER Model Loss Function.Because there are too much nonentity data in the private entity identifcation dataset, the long tail phenomenon exists in the dataset.In this study, formula ( 9) is used to calculate loss.Tis calculation method improves the multi-label cross-entropy loss function, such that the score of the private tag class is higher than that of the nonprivate tag class [59,60].Te inferential details of the formula are described in Appendix A.
where P is the set of privacy class, N is the set of nonentity class, and S i , S j represent the category score.dog is International Journal of Intelligent Systems seq2seq mode for privacy inference.g D and g E represent the decoding and encoding operations, respectively.g BS indicates that the data are processed using the beam search algorithm.g RFB represents the roformerBERT model operation that contains rotational position encoding operations.H IPI represents the output of the IPI model.Moreover, the model adopts the softmax function for normalization processing before token generation.Te loss function used in this model is still the traditional cross-entropy loss function.
In the IPI model, we employ the roformerBERTmodel to extract text features.Te UniLM framework is utilized to address the issue of BERT's inability to generate text, enabling the completion of unidirectional, sequence-to-sequence, and bidirectional prediction tasks, while integrating the advantages of autoregressive and autoencoder language models [61].
Te specifc steps of the IPI model are as follows.
Input: Attention mechanism head number, heads, the size of each head, head size, and the input data, inputs.
Output: (inputs.shape[0],heads, inputs.shape[1], inputs.shape[1])type of tensor 8 International Journal of Intelligent Systems Step 1. Word segmentation is performed on the input text data.Te token dictionary adopts the word vector during roformerBERT pretraining, refning the characteristics of the input text more precisely.Tis model uses the Jieba word segmentation technique for word segmentation.
Step 2. Te word encoding vector is obtained by converting the split text by using the word dictionary Vocab.txtfle, which is a word-to-word text mapping.Tis current encoding conversion requires generating position encoding and word encoding.
Step 3. Te encoding vector generated in the previous step is inputted into the roformerBERT + UniLM model for data generation and encoding.One token is outputted at a time.
Step 4. Te beam search algorithm is used for text decoding (as shown in Algorithm 2).
Step 3 of the loop selects the frst n maximum-scored token of each output.Te selected token is calculated with the previously generated token coding sequence.Te sequence with the highest fnal score is selected to enter the next cycle.
Step 5. Determining whether the output value contains an end character fag or if the output string exceeds the predefned maximum length.Once these conditions occur during the loop, the output token sequence is converted into text output.

UniLM Model.
Te UniLM framework is composed of multi-layer transformer networks, where the core is a BERT model.By converting the BERT model, the three tasks of bidirectional LM, left-to-right LM, and seq-to-seq LM can be completed simultaneously.Figure 6 shows the structural diagram of the UniLM framework [61].In this study, the core network consists of 12 or 6 layers of the transformers.First, the input vector Ten, it is sent to the 12-layer or 6-layer transformer network.Each layer coding output is shown in formula (11).H l represents the l-layer output.
Each layer controls the range of attention of each word by the mask matrix M. If an element in the matrix M has a value of zero, then it indicates attention; otherwise, it indicates no attention, and the corresponding feature is masked.Formula (13) is the calculation method of the mask matrix M. For the l-layer transformer, the output of the selfattention head A l is calculated as shown in formula (14).Formula ( 12) is used to calculate the Q, K and V matrices.
M i,j � 0, Allow to attend, − ∞, Prevent from attending, In the IPI model, we choose the seq2seq ML model for the inference of interest privacy and the generation of the corresponding interpretable text.Te seq2seq ML mode is a combination of bidirectional LM and left-to-right LM.Specifcally, we defne the input statement as X � (x 1 , x 2 , . . ., x n ) and the output statement as Y � (y 1 , y 2 , . . ., y n ).During model calculation, bidirectional LM operation is performed on X, while left-to-right LM operation is performed on Y. Te calculation formula for the bidirectional LM operation is presented in (15), where the dimensions of H are the same as those of input X. g bert representation is a BERT operation and g embedding indicates that X vector (matrix) is embedded after linear changes.Left-to-right LM ( 16) operation generates Y unidirectionally on the basis of the feature vector H given by the bidirectional LM operation.Formula ( 16) is mainly intended to calculate P (.).Te sequence of Y is generated in the case of the maximum value.By using the seq2seq ML model, we can fnally predict the interest attribute of the text and extract which data in the input text support this prediction.2. Given that multiple privacy entities may be involved in one piece of data, the sum of the entity number of each privacy item in Table 2 is greater than the overall number of dataset entities.Figure 7 calculates the specifc number of each privacy entity item, including 2,128 privacy data items for LOC, 1,502 privacy data items for BI, 7,036 privacy data items for JOB, 2,883 privacy data items for EDU, and 4,002 privacy data items for COM.

Interest Privacy Inference Dataset.
In this work, 4,694 interest datasets are retrieved using a data crawler technology in the interest region of Sina Weibo.Tese datasets have 12 interest categories, and the distribution of each category is shown in Figure 8.Among these datasets, 460 belong to the lifestyle category, 394 to the design aesthetics category, 391 to the games category, 536 to the sports category, 280 to the variety show category, 280 to the flm and television category, 382 to the fnance category, 346 to the tourism category, 260 to the mother-and-child category, 409 to the animation category, 442 to the reading category, and 514 to the food category.For each type, we mark its corresponding recognition basis.

Metrics.
Te evaluation indexes used in this study are precision (P) calculated using formula (18), recall (R) calculated using formula (19), F1 calculated using formula (20), and accuracy (ACC) calculated using formula (21).Te specifc calculation formulas of these evaluation indexes are as follows: where the true positives (TP j ) are the positive events that are correctly predicted, the true negatives (TN j ) are the negative events that are correctly predicted, the false positives (FP j ) are the negative events that are incorrectly predicted to be positive, and the false negatives (FN j ) are the positive events that are incorrectly predicted to be negative.j represents the corresponding category.In this study, the generative text algorithm is adopted in the IPI model.Terefore, the more popular recall-oriented understudy for gisting evaluation (ROUGE) measure is used to test the efect of the generative text.ROUGE was presented in 2004 by I Chin-Yew Lin.It is a set of metrics for evaluating automatic summarization generation tasks and machine translation tasks [62].Te main ROUGE metrics are as follows: rouge-1 (formula ( 22)), rouge-2 (formula ( 23)), rouge-L (formula ( 24)), and main (formula ( 25)) [63].
Te denominator in the rouge-1 and rouge-2 indicator formulas is the number of n-gram in the standard generated text, and the molecule is the number of n-gram, where the model-generated text and the standard generated text coincide.In the formula, gram_1 means 1-gram, and gram_2 means 2-gram.In the rouge-L index formula, LCS (X, Y) indicates the length of the longest common subsequence in the X and Y sequence.X represents the standard-generated text, Y represents the model-generated text, m and n indicate the length of X and Y, and β is a regulator.Te main index is a weighted sum of the above three aforementioned indexes.Each cycle during the training session cut the overall data into 620 pieces, and all of the trained models undergo 18,600 training sessions.A validation test is performed after the end of each cycle, and the results of the validation test are shown in Figure 11.When the learning rate of the model is 1e − 5, the F1 value of the validation set is the highest, and the model achieves the best result.

Selection of Hyperparameters of the IPI Model.
Te IPI model uses the principle of seq2seq model for model construction, and roformerBERT + UniLM is used to build the model.Te major hyperparameters in the structure are the same: batch size, cycle number, and learning rate.Batch size and cycle number are 8 and 50, respectively.Te learning rate plays a decisive role in the fnal efect of the model.Te learning rate is screened from 1e − 1 to 1e − 10.In the experiment, the loss value is 0 when the learning rate is greater than or equal to 1e − 3 and less than or equal to 1e − 7. Terefore, we demonstrate the training situation from 1e − 4 to 1e − 6, and the training results of the specifc learning rate are presented in Figure 12.Given that the seq2seq model structure is used, the encoder is trained during the training.To measure the efect of each text generation, we use the training set to test it.Te quality of the generated text is measured using the ROUGE detection method.As shown in Figure 12, 1e − 4 can achieve the best results in all the four indicators, with the main index reaching 97.63%, the rouge-1 index reaching 98.36%, the rouge-2 index reaching 96.55%, and the rouge-L index reaching 98.36%.3 and 4 present these aspects.Te DPER model is evaluated using four indicators: ACC, F1, P, and R. Tables 3  and 4 indicate the prediction efects of the DPER model.From the overall performance of the test set, ACC reached 91.80%, the F1 was 93.74%, the P value was 97.41%, and the R value was 90.33%.From the recognition of each privacy item, BI and EDU privacy entities are not as efective as the four other privacy indicators.Te primary reason is that the sample space of the privacy entities marked by BI and EDU is relatively large, and the regularity is relatively complex, leading to the imperfect feature information learned by the model in this respect.

Efect of the IPI Model.
A total of 1,200 interest test texts are collected to test the IPI model.Figure 13 shows the inference accuracy of each interest.Te lowest accuracy of the model can reach 96% in interest inference, and some interest inferences can reach up to 100%.Tis outcome shows that the IPI model designed in this study is feasible for IPI.

Comparison of BERT Models per Version.
With the extensive use of the BERT pretraining model in DL, various versions have also been produced accordingly.Te 12-layer BERT, 6-layer BERT, 12-layer roformerBERT, and 6-layer roformerBERT models are compared.Te model proposed here will be eventually run on the user clients, but some clients do not have sufcient memory, and thus, a relatively small 6layer model is used for training.Te BERT model can only handle 512 characters, and thus the roformerBERT model can extend the data processing length via rotary encoding.Table 5 provides a comparison of each pretrained model of the privacy entity recognition model.Te privacy recognition model built using the roformerBERT (12) pretraining model can achieve the highest efect.In the test, the F1 can reach 95.74%, P can reach 98.21%, and R can reach 92.53%.Simultaneously, the roformerBERT pretraining model exerts greater efect than the BERT pretraining model because in the roformerBERT model, rotation coding and data dictionary combined with words are used in token conversion.Figure 14 illustrates the recognition of each privacy entity of each version.As shown in Figure 14, the performance of the roformerBERT pretraining model is better than that of the BERT pretraining model.In Figure 14, however, the pretrained model of BERT (12) still performs better than the pretrained model of roformerBERT (6) in some indicators.Te possible reason for such result is that the unbalanced distribution of privacy entities in the sample data leads to diferences in model learning among privacy entities.
In the DPER model, we have proven that the roformer-BERT pretrained model outperforms the basic BERT model.Terefore, this study only uses the 6-layer roformerBERT and    International Journal of Intelligent Systems the 12-layer roformerBERT models in the IPI model.Specifc pairs are shown in Figure 15.Te blue lines in the fgure denote the privacy inference test performance that uses the 6-layer roformerBERT pretrained model.Te yellow lines represent the privacy inference performance of the 12-layer roformerBERT pretrained model.Overall, the results of using the two pretrained models for IPI are nearly the same.Te main index can reach more than 97%, the rouge-1 index can reach more than 98%, the rouge-2 index can reach more than 96%, and the rouge-L index can reach more than 98%.However, the overall fuctuation of the model with the 12-layer roformerBERT is relatively large during training, probably because the overall parameter number of the pretraining model with the 12-layer roformerBERT is relatively large.Meanwhile, the amount of data we have inputted is relatively small, resulting in a large fuctuation during learning.

Comparison with Other Models.
Te privacy entity recognition model in this study is designed on the basis of the principle of entity recognition.Tis research makes a comparative analysis of several popular entity models.Te models for comparison are the BI_LSTM-CRF model [64], BERT-CRF model [65], ALBERT-BI_LSTM-CRF model [66], EN2_BI_LSTM-CRF-CRF model [67], and ALBERT-MogAtt_BI_LSTM-CRF model [68].P, R, and F1 are compared.Te comparison results are provided in Table 6 and Figure 16.Table 6 indicates that the performance indexes of our model are higher than those of the BI_LSTM-CRF, BERT-CRF, ALBERT-BI_LSTM-CRF, EN2_BI_LSTM-CRF-CRF, and ALBERT-MogAtt_BI_LSTM-CRF models.All these models use CRF for the fnal physical output.Although this method can achieve good results in many domains, the composition of privacy entities is complex.Moreover, the   6) BERT (12) roformerBERT ( 6) roformerBERT ( 12)  International Journal of Intelligent Systems model has many nested entities.Tus, the output of privacy entities by CRF is not very favorable.Te method developed in the current research can efectively deal with these problems, achieving good results in privacy entity recognition.Figure 16 illustrates the recognition of each privacy entity in a model.From the F1 value, the current model is about 10% higher than the BI_LSTM-CRF model, about 4% higher than the BERT-CRF model, about 8% higher than the ALBERT-BI_LSTM-CRF model, about 5% higher than the EN2_BI_LSTM-CRF model, and about 2% higher than the ALBERT-MogAtt_BI_LSTM-CRF model.Tis outcome indicates that the privacy entity recognition model proposed in the current research is feasible.Tis study also compares the proposed IPI model with the more popular interest recognition models at present, namely, the char2vec + CNN and word2vec + CNN models [38], as shown in Figure 17.Te fgure indicates that the interest inference efect of the proposed model is not diferent from those of the char2vec + CNN and word2vec + CNN models.Moreover, the inference efect of each interest item exhibits its own advantages and disadvantages.However, our model can provide the interpretability of the given interest inference, i.e., which information can be outputted to support the given interest inference result.Terefore, our IPI model is also better than existing models.7.For each model, we design a large model and a small model to meet the deployment requirements on diferent hardware platforms.Te large model uses a 12-layer rofor-merBERT model, while the small model uses a 6-layer roformerBERT model.It can be seen from Table 7 that the number of DPER model parameters is 124 M and 30 M, which is 2.31 s and 1.17 s for the same sentence, respectively.Te number of IPI model parameters is 102 M and 19 M, and 6.72 s and 3.05 s for the same sentence, respectively.Te IPI model requires a long test time, primarily due to the large time consumption of generating text.However, it is still within the allowable range.(2) Time Complexity of the roformerBERT Model.Te roformerBERT model is composed of an embedding layer, a position encoding layer, an attention layer, a dense layer, an add layer, a norm layer, and a feedforward fully connected layer.Te time complexity of the embedding layer is  .

Model Complexity
(5) Time Complexity of the IPI Model.Te IPI model is designed in accordance with the seq2seq mode, and thus its time complexity is composed of the complexity of the encoder and decoder modules.Te following is the complexity of the two modules.
4.8.Discussion.Te DPER model is proposed to solve the problem, in which the traditional social network selfdisclosure privacy identifcation model can only provide the type of privacy leakage, but not the corresponding location of the privacy leakage.Table 8 shows the diference between our method and the traditional self-disclosure privacy recognition model of social networks.As indicated in the table, our model enables users to more directly understand which words are revealing their privacy.
Te DPER model uses NER to extract privacy, but the nested privacy entities cannot be extracted by the traditional CRF-based NER method.Terefore, the GP algorithm is used to perform the extraction of the nested privacy.For accuracy, the following statement is provided as an example: "imperceptibly come to Hainan Qianfan Culture Media Co. Ltd. has been more than a year."Tis sentence has nested address privacy, i.e., "Hainan" is an address and "Hainan Qianfan Culture Media Co. Ltd." is a company.Named entity identifcation based on CRF will directly identify the company entity of Hainan Qianfan Culture and Media Co. Ltd. but cannot identify the address entity of Hainan.Te specifc model prediction results are shown in Table 9.In Table 9, B_COM represents the beginning of a company entity, I_COM is the middle and end of a company entity, and O is not an entity.Table 8 indicates that the CRF-based models do not predict the "Hainan" entity.Our model difers from the traditional BIO output because it outputs a coordinate (i, j), where i represents the start position of an entity and j represents its end position.In Table 9, (6, 10) represents a company entity that starts from the sixth position in the sentence and extends up to the tenth position.In Table 9, the BLC, BC, ABLC, EBLCF, and ABLMBLC, respectively, refer to the BI_LSTM-CRF [64] model, BERT-CRF [65] model, ALBERT-BI_LSTM-CRF [66] model, En2_BI_LSTM-CRF [67] model, and ALBERT-MogAtt_BI_LSTM-CRF [68] model.
Given the low accuracy of extracting interest privacy by using the NER model, the reason may be the variety of interest expressions in the text.Although many interest recognition models are available to identify interest, these algorithms cannot provide the interpretability to support this interest.To solve this problem, we propose the IPI model for detecting interest.Tis model adopts the design model of seq2seq and the popular UniLM framework to transform the BERT model into a generative model and generate the explanatory text that supports the interest that exists in the source text.For accuracy, the following statement is provided as an example: "I made braised pig trotters at home, which are easy to make, with a chewy and soft texture and a delicious spicy and savory taste.Tey are full of collagen, so delicious!"Using the traditional interest classifcation algorithm, this statement will only indicate an interest in food.Using the IPI model, this statement can indicate an interest in food and provide the corresponding text to support this judgment.Te specifc output is shown in Table 10.
In summary, our proposed PA framework is fundamentally diferent from traditional privacy perception models.Our model can not only provide accurate types of privacy leakage but also ofer a specifc text description that supports this type of leakage.

Conclusion
Tis study proposes a PA framework for social networks that can automatically sense sensitive text information shared by users.It accurately locates which part of the text is leaking sensitive information and sends these privacy data as feedback to users to enhance their PA.Tis framework consists of two parts.Te frst part is the direct privacy module, which uses named entities to extract a direct privacy entity.In this module, we combine the roformerBERT model, BI_LSTM model, and GP algorithm to train the DPER model that can not only identify private information in social text but also provide the location of private information.Te model proposed in this module is tested on a test set, and it can reach 95.74%, 98.21%, and 92.53% in the indexes of F1 score, P, and R, respectively.Te second part is the indirect privacy module.Some indirect privacy leaks are difcult to uncover in our experiment.Tis module combines the roformerBERT model and the UniLM framework to construct an IPI model for users.Meanwhile, interpretable text information is added when training the model.Te designed IPI model can not only identify which privacy information of interest is being leaked in social text but also provide which information is serving as a guide in the corresponding text.Te model in this module can reach the following indexes: main is 97.63%, rouge-1 is 98.36%, rouge-2 is 96.55%, and rouge-L is 98.36%.
Te proposed model framework can be applied to social network scenarios, text desensitization scenarios, privacy measurement calculations, and other scenarios.Simultaneously, we develop an application to provide users with privacy-aware services.Our application adopts a lightweight model.Te data provided by users are only calculated locally, and no data collection is performed.
Although the model framework developed in this study can obtain good results in social network PA, it still requires considerable improvement to preserve personal privacy data in the entire social network.With regard to the defnition of privacy, this study adopts a sweeping defnition.However, the subject of privacy is humans, and diferent individuals have varying defnition scopes of privacy.Designing a personal PA framework is the subject of our future research.Simultaneously, common privacy disclosure is the primary source of privacy disclosure in social networks, and the recognition of common privacy disclosure is another subject for future research.

A. Derivation of the Formula of Loss Function
For the multi-label classifcation task, our goal is to make each target class score no less than that of each nontarget class.P is a label class, Q is a nonlabel class, and s represents the scores of each class.Te loss value is calculated using the crossentropy function as shown in the following formula: Make a class 0 so that the label class scores greater than S 0 and nonlabel class scores less than S 0 .In order to satisfy S i < S j , it is necessary to add e S i − S j to the loss calculation formula.Overwrite formula (A.1) to get the following formula:

Figure 1 :
Figure1: Overview of the automatic PA framework for text data in OSNs.First, the framework uses NLP technology to transform the features of the obtained text data, including word segmentation and tokenizer operation.Ten, we use DL models for privacy entity sensing and inference.We construct the DPER model for privacy entity sensing and the IPI model for IPI.Finally, we combine the calculated values of the two models and send them as feedback to the user.

Figure 2 :
Figure 2: An overview of the DPER model.Te input to this model includes text data, which are transformed into a data format by tokenization operation.Features are extracted using the roformerBERT model and the BI_LSTM model.Finally, the GP algorithm is used to predict the privacy entities.Te parameter n indicates the length of the input sentence.Te output S [i: j] indicates that a privacy entity appears from the ith to the jth position in the input text.

3. 3 .
IPI Model.Te IPI model is composed of the rofor-merBERT pretrained model and the UniLM model.Te overall model diagram is presented in Figure 5. Formula (10) is a feature representation of the IPI model that uses the

( 1 ) 5 ALGORITHM 1 :Figure 5 :
Figure 5: An overview of the IPI model.Tis model is the same as the traditional seq2seq framework, which is divided into encoding and decoding operations.In the encoding operations, the roformerBERT model is used for feature extraction, and inference data are generated through the seq2seq LM model in the UniLM framework.Te beam search algorithm, copy operation, and softmax function are used in the decoding operation.

4. 4 . 1 .
Efect of the DPER Model.Tis study constructs a test set to evaluate the fnal DPER model, with 560 pieces of data.Te test set contains 1,123 privacy entities, including 456 LOC, 188 BI, 204 EDU, 181 JOB, and 90 COM privacy entities.Te model is evaluated in terms of ACC, F1, P, and R. Tables

Figure 10 : 1 F1Figure 11 : 6 Figure 12 :
Figure 10: Te F1 value changes during the training of each learning rate of the DPER model.

Figure 14 : 4 Figure 15 :
Figure 14: Performance of various BERT models in each privacy entity recognition.

4. 7 .
Complexity Analysis 4.7.1.Number of Parameters.Te model parameters and test times of our proposed DPER and IPI models are provided in Table

( 1 )
Time Complexity of DPER Model.Te DPER model is composed of roformerBERT model, BI_LSTM model, and GP algorithm, and thus the overall time complexity of the model is sum of these three models.Te following discussion specifcally analyzes overall model time complexity.

log 1 +
i∈Q,j∈P e S i − S j +  i∈Q e S i − S 0 +  j∈P e S 0 − S j ⎛ ⎝ ⎞ ⎠ .(A.2)

Table 1 :
Te main symbols in the model calculation process.

Table 2 :
Distribution of the number of statements in the privacy entity recognition dataset.Figure 7: Distribution of the number of privacy items in the privacy recognition dataset.
Selection of Hyperparameters of the DPER Model.Te DPER adopts the roformerBERT + BI_LSTM + GP structure in which the major hyperparameters include batch size, cycle number, and learning rate.Batch size and cycle number, which are set as 16 and 30 in this research, respectively, afect the training speed of the model.Te learning rate is decisive for the fnal efect of the model, and this work reduces the learning rate from 1e − 1 to 1e − 10 by the order of the magnitude step of 0.1.During training, we discover that when the learning rate is greater than or equal to 1e − 3 and less than or equal to 1e − 7, gradient explosion occurs in the entire model training.Terefore, we conducted a test between 1e − 3 and 1e − 7, and the result of the learning rate training is presented in Figures9 and 10.When the learning rate is 1e − 4, the optimal F1 of the training model is 96.72%.When the learning rate is 1e − 5, the optimal F1 of the training model is 98.83%.When the learning rate is 1e − 6, the optimal F1 of the training model is 81.68%.Hence, the model learning rate of the experiment is 1e − 5.
Figure 9: Te loss value changes during the training of each learning rate for the DPER model.

Table 3 :
Accuracy of the DPER model for the overall privacy entity and each privacy entity.

Table 4 :
F1, P, and R of the DPER model for the overall privacy entity and each privacy entity.

Table 5 :
Comparison of various pretrained models in privacy entity recognition models.
where E in L,V represents the input of the embedding layer, L represents the length of the input text, and V represents the dimension of the word dictionary.E out V,H represents the output of the embedding layer, and H represents the dimension of the word vectors output by the roformerBERT model.Te time complexity of the position encoding layer is O(P in 1,L * P out L,H ), where P in 1,L represents the input of the position encoding layer, and P out L,H represents the output of the position encoding layer.Te time complexity of attention layer is C * O(A in L,H * A out H,64 ), where C represents the number of layers in the attention layer, A in L,H represents the input of attention layer, and A out H,64 represents the output of attention layer.Te time complexity of dense layer is O(C * L * H * H), where C represents the number of dense layers.Te time complexity of add and norm layer is C * [O(L * H) + O(H * 2 * 2)].Te time complexity of the feedforward fully connected layer is C * O(L * H * I), where I represents the number of hidden neurons in the feedforward full link layer.

Table 6 :
Comparison with the other entity recognition models.Figure 16: Te recognition efect of each privacy entity recognition in each entity recognition model.L denotes the length of the text, and H represents the output word vector dimension of the roformerBERT model.
* H * H), that of the MLM-norm layer is O(L * H) + O(H * 2 * 2), and that of the MLM-bias layer has no practical operation and only provides data.Te time complexity of the MLM-activation layer is O(M in L,H * M out H,V ), where M in L,H represents the input of the model,

Table 7 :
Comparative analysis of the model parameters and the test time.

Table 8 :
Comparison of the DPER model with the traditional self-disclosure privacy identifcation model of social networks.

Table 9 :
Te DPER model-predicted result for sentence.

Table 10 :
Te IPI model-predicted result for sentence.