CLDA: An Effective Topic Model for Mining User Interest Preference under Big Data Background

the original


Introduction
In the background of today's big data, it is an important part of the enterprise activity planning to accurately excavate the interest preference of the user-specific fields from the large data.Nowadays, the emergence of a social network represented by microblog makes a large number of users more willing to use it to share their interest in various fields.Microblog platform will provide a large number of user emotion data, which can be used to mine the user's interest preferences in specific areas.Therefore, a lot of user data on the microblog platform can effectively mine the user's interest and bring huge commercial value.
At present, most of the research on microblog is based on the analysis of the relationship between users and the community [1], and few studies are on microblog content.Traditional text mining algorithms are mainly used in traditional corpus and do not take into account the special structural information contained in the microblog data texts, so we cannot model the microblog data text very well.Topic models are an effective method of text mining, but traditional topic models such as pLSA [2] and LDA [3] all learn potential topics in the corpus by developing words from the document.As a result, topic models often suffer from severe data sparseness problems when applied to microblog short text.A popular and effective strategy is to overcome this bottleneck by aggregating short texts into long texts based on user information, title categories, and so on [4,5].However, these methods are heuristic and highly dependent on the data.In addition, such aggregated long text content is excessively redundant, reducing the accuracy of mining user interest preferences.Therefore, this article is based on these previous studies.In terms of text processing, we combine the characteristics of microblog with introducing time dynamics for each user's short text, short text extensions using usergenerated short texts, and information retrieval tools.On the short text and long text problems facing, we propose Combining Latent Dirichlet Allocation (CLDA), a new topic model that can learn the potential topics of microblog short texts and long texts simultaneously.The data sparsity of short texts is avoided by aggregating long texts to assist in learning short texts.Short text filtering long text is reused to improve 2 Complexity mining accuracy, making long texts and short texts effectively combined.We borrow the Gibbs sampling method to derive our model.The experimental results show that this model is superior to many other advanced models in mining interest preference in specific fields in a large number of microblog user data.
The main contributions of this paper are as follows: (1) A long text acquisition scheme is used to aggregate microblog short text in a user unit and expand microblog short text by using the information retrieval tool.
(2) The dynamic time attribute and Ebbinghaus forgetting curve are integrated into microblog short text, which makes it more reasonable to mining users' interest preferences.
(3) A new topic model, CLDA, is proposed to learn the potential topics of microblog short text and long text at the same time.By using long text to assist the learning task of short text, the short text data is avoided, and the short text is used to filter long text to improve the mining accuracy.
(4) The experimental results demonstrate the superiority of the proposed method and can be extended to recommendations in various fields.
The outline of this paper is as follows: The relevant work is briefly reviewed in Section 2; Section 3 briefly introduces the LDA topic model; then the method we proposed is introduced in detail in Section 4; the experimental results of the actual Sina Microblog data set are given in Section 5; and finally in Section 6 we draw conclusions and introduce the next work.

Related Work
The predecessor of probability theme model can be traced back to LSA (Latent Semantic Analysis) [6].LSA is based on spatial dictionaries, and implicit semantic documents are implemented in low-dimensional representation of space, but it cannot solve the problem of polysemy.Hofmann proposes a PLSA (Probabilistic Latent Semantic Analysis) [7,8] for the defect of LSA, mainly using the probability distribution corresponding to one dictionary in each dimension.However, PLSA does not provide a probabilistic model at the document level, which leads to overfitting problems easily due to the linear increase in the number of parameters to be estimated in the model with the size of the corpus.LDA (Latent Dirichlet Allocation) [3] is a generation model that uses the Dirichlet a priori distribution of topics to overcome the shortcomings of PLSA.The model can find the semantic structure of the text set, mining the theme of the text.
Xiong et al. [9,10] used the existing user interest modeling method LSARS to represent the user's interest and fused the LDA topic model with the geospatial attributes to overcome the data sparsity with user interest mining.Specifically, they first divided geospatial attributes into subregions where one's personal interests can be inferred from a set of topics.With regional and thematic interdependence, LSARS combines geopolitical clustering and LDA thematic modeling into a single process.In order to further reduce the data sparsity of user behavior, LSARS further integrates the crowd's preference [11].However, these approaches that incorporate geolocation attributes are only applicable to mining local user interest preferences, and the effects applied to other areas are less than ideal.Our approach is to incorporate time dynamics, independent of the geographic location of the space.
In short text processing, many strategies have been widely used in data mining tasks, especially query extensions with relevant feedback [12,13], semantic correlation analysis [14,15], short text classification [16,17], and interest extraction [18,19].However, short texts often have large data sparsity and often do not work well when decimated.Tang et al. [20] proposed an end-to-end solution to the short text sparseness and automatically learned how to extend the short text to optimize a given learning task.A novel deep memory network was proposed to automatically extract relevant information from a long list of documents and to redesign short texts through gated mechanisms to avoid short text sparsity.The expanded text usually takes the form of an interpolation between the original short text and the retrieved document before it is used for other tasks.These methods are only intended to solve the sparseness of short text data and ignore the retrieved documents that contain noise, and the interpolation weights are heuristically set, so these errors may accumulate in the task and compromise the accuracy of the final task result.The difference compared to our work is that we use long texts to help short texts perform learning tasks and thus overcome the data sparsity of short texts.Short texts, on the other hand, can filter extended data sets to reduce noise interference and greatly improve the accuracy of the final result.
Since the LDA topic model was proposed, it has received extensive attention from researchers.Many scholars have continuously improved the LDA model to achieve the desired topic mining effect.RosenZvi et al. [21] proposed the Author-Topic Model (ATM) to aggregate user tags into a large document and analyze and model the user's interests.Ramage et al. [22,23] proposed that Labeled-LDA modeled the topic of microblog texts, used tags to associate topics with tags, and implemented supervised learning text topics.Weng et al. [4] proposed User-LDA to merge all the text content published by each user on Twitter into one large long text and then use the standard LDA model to extract the user interest on the long text.Zhao et al. [24] believe that microblog is relatively short and proposes a Twitter-LDA that only learns 1 topic for all words in the microblog.However, on a highly dynamic social platform such as microblog, new topic appears constantly and user interest preferences are constantly changing.The topic mining of aggregating long texts does not well represent users' dynamic interest preferences.Each short text of the user also contains user interest information.Therefore, we propose CLDA, a new topic model that can learn the potential topics of microblog short texts, and long texts simultaneously which avoids the data sparsity of short texts by aggregating long texts to assist in learning short texts.Short text filtering long text is reused to improve mining accuracy, making long texts and short texts effectively combined.

Latent Dirichlet Allocation (LDA)
In the field of topic discovery, public opinion analysis, text categorization, and so on, LDA topic model has become a common way to catch the distribution similarity between words (vocabulary, semantics, or even syntax).The basic idea is to give each document set in the form of probability distribution and to extract the theme distribution through analyzing some documents.After the theme clustering, theme distribution or text classification can be carried out.At the same time, it is a typical bag-of-words model; that is, a document consists of a set of words, and there is no order relationship between each word.In addition, a document can contain multiple topics, each of which is generated by one of the topics.The document generation process is described in Algorithm 1, which corresponds to the graph model shown in Figure 1, where the variables  and  and  (assigning word labels to topics) are three sets of potential vectors to infer, with each column of the vectors indicating the probability of each topic occurring in the document, which is a nonnegative normalized vector.As mentioned earlier, hyperparameters  and  are constants in the model and need to be set manually.  represents the th word generated ;   represents the selected theme, ( | ) represents the probability distribution of the theme  given ; ( | ) ia similar.The arrows in the figure indicate the conditional dependencies between the variables, while the boxes in the figure refer to the repeat sampling step, with  and  indicating the number of samples.A box around  means that the word distribution is repeated for each topic  until  topics have been generated.
From the generation of the LDA topic model, we can see that LDA is an unsupervised machine learning technology that can be used to identify the hidden topic information in large-scale document collections or corpus.It is an effective method for mining large data texts.However, the microblog information is characterized by shortness, few representative words, and so on.The application of LDA directly to the short text such as microblog will not play a good role.Since the LDA topic model is a bag-of-words model and does not consider the order between words, it simplifies the complexity of the problem and provides opportunities for improvement of the model.

Proposed Method
4.1.The Reason for the Proposed Method.Microblog's large amount of user data information is characterized by shortness and less representative words, so it is much more difficult to learn the topic model directly from the traditional long text in the short text of microblog.For this reason, many scholars have proposed to train topic models on a syndicated long text in the same field and then infer the essay to help short text learning tasks [25,26].However, on the highly dynamic social platforms such as Weibo, new topics are constantly appearing and user preferences change constantly.Therefore, it is particularly important to better grasp the preferences of users.In this section, we describe a way to better mine the user's interest preferences by designing a new topic model (called Combining Latent Dirichlet Allocation (CLDA)).When learning topics from short texts, you can use long text as auxiliary features to solve the data sparseness problem of short texts.When learning topics from long texts, you can use short text to filter long texts to improve the accuracy of mining user interest preferences.The model can well combine the advantages of short text and long text and can optimally choose hyperparameters  and .We will handle all the long text and short text in the corpus and output of a -dimensional themed vector.Our long text aggregation strategy is to aggregate each user's short text content into each user's long text.Faced with the characteristics of the existing sparse short text theme modeling, the strategy we adopted is a combination of information retrieval technology and Wikipedia data.The main approach is to build a search engine based on Wikipedia data, cluster the short text into a keyword query result, and return a short feature extension from the query.Thus, the original short text set and the auxiliary long text set are constructed.Our inspiration for CLDA comes from the fact that long text indexes are more modeled than short texts [27].Complexity 4.2.Long Text Processing Method.Microblog information is short, representative words with fewer features.LDA modeling using short text has serious data sparsity problems.Therefore, many scholars use the original short text  as a query to search a large set of potentially relevant long documents Cq from an external large set .This file will be used by the model as a "raw" for text extensions.The goal of this step is to get the relevant documents and a high recall rate.Existing technologies, such as reverse indexes used in information retrieval, sensitive areas of high-dimensional data points, and APIs directly from existing search engines can be utilized to implement the process [26].If you want to make sure the recall rate is high, you have to set the number of returned documents to be quite large, for example, tens or hundreds of documents, but the resulting long text has a significant noise disturbance.We extract a long document   randomly from the long document Cq returned by the query and aggregate all the original short texts of the user into a long text Ls to form a group of long document vectors  →  = { 1 ,  2 , . . .,   ,   } to assist short texts.

Short Text Processing
Method.A user's interest preferences in a specific field may be generated by the user's historical behavior record and the keyword interest distribution of microblog.However, all the microblog posts published by the user are sent at different time periods, so the temporal attributes of the short text can well reflect the degree of the user's preference of interests in a particular field at a certain time point.It can be expressed by a set of weight vectors as shown in where  is the total number of short texts published by the user;   is the subject keyword distribution of the user microblog; S  denotes the user's preference degree of interest to the th topic under the specific field;  is the subject number, and  is a constant;   is the weight of interest preferences over time in a particular area of the user.  follows the Ebbinghaus forgetting curve [28].Ebbinghaus forgetting curve is a curve used to describe the change of human's memory over time.Previously, some scholars have used it to model the preference model, which shows that the user's preference changing process follows the same rule as the curve process [29,30].With the increase of time, the user's preference of interest declines sharply at first and decays to a certain extent and then shows a steady decline.  is shown by the formula below: where  is the current time and  1 is the time at which the user posted the document.Due to the different speed of change in the user's interest preferences, we have added a dynamic active parameter  to establish a different Ebbinghaus forgetting curve for each user.Each user activity sliding window mechanism is as follows: (1) Set the initial value  = 7; that is, the minimum observation period of the active sliding window is 7 days.
(2) Calculate the total number of microblogs originally created and forwarded by the user within [ − , ], denoted as .
(3) If the value of  is less than 30 in the time sliding window, the time window is expanded,  =  + 2.

Combining Latent Dirichlet Allocation (CLDA).
In the CLDA model, we mainly use two key ideas on how to combine short and long texts.
(1) We can create two different approaches for short texts and long texts (as shown in Sections 4.2 and 4.3) and establish a new thematic model that can be used for auxiliary long text data and target short text data.This approach captures the main topics in each of the two data sets separately, derives the interest preferences in each short text of one user and the interest preferences in the auxiliary long text, and filters out irrelevant or inconsistent topic interest preferences in the auxiliary data.
(2) We can also use different build procedures for auxiliary long text and target short text, respectively, so that the model facilitates more accurate mining of user interest preferences in specific fields and the use of topic generated documents belonging to their field.
In order to better combine the long text with the short text, we use CUI to express the preference of a user.It includes the relationship between the user's preference of all short text and long text.CUI is defined as where   is the weight vector of each short text of a user;  is the value of mining interest of aggregated long texts, and when the aggregated long texts have no preference of users in a particular field, the value is 0; on the contrary, the value will be a fixed value   ;  1 and  2 represent the weight of short text and long text, respectively.
We propose a new model for auxiliary long text data and target short text data, called Combining Latent Dirichlet Allocation (CLDA), which considers the relationship between short text and long text of microblog users based on LDA.
CLDA generation process is shown in Algorithm 2; Bayesian network diagram is shown in Figure 2, where  represents the temporal effect of short text, and its value is determined by the   ;  stands for the relationship between long text and short text, obeying the binomial distribution with parameter , which a priori obeys the Dirichlet distribution () ∼ Dirichlet().The main role lies in the use of long text to help short text learning tasks and the value of the CUI decision;   ,   ,   indicate the number of samples in the short text.  ,   ,   indicate the number of samples in the long text.The left side of Figure 2 is the topic generation process for short texts.The right side is the generation process for long texts, and the middle is combining long texts and short texts.First, the short text and the long text in the left and right sides of Figure 2, respectively, select a topic word distribution () for each topic from the hyperparameters   ,   of the Dirichlet distribution.This process corresponds to step (1) of Algorithm 2. Second, when generating documents, the model selects the topic distribution  from only the Dirichlet distribution hyperparameters   of the short text, if the long text does not have the microblog text associated with the interest preference in the current user-specific domain.
If there is a long text of a microblog text associated with a hobby preference under a specific area of the current user, the topic distribution  is selected from the Dirichlet distribution hyperparameters   ,   of each short text and long text.This process corresponds to step (2) of Algorithm 2. Finally, according to the probability distribution of topic , select the topic for each document, and then select one word from the topic word distribution.Repeat until the long and short texts have their own documents and place them in .Calculate the joint probability of the final long text and short text through CUI.This process corresponds to step (3) of Algorithm 2.
In the CLDA model, the topic distribution of microblog texts is shown in where   follows the following formula: where  1 ,   (6) 4.5.Model Inference.We used the Gibbs sampling method to derive the CLDA model.The Gibbs sampling method, one of the most widely used methods of the Markov Chain Monte Carlo (MCMC) method, is used to obtain a series of joint probability distributions approximately equal to a given multidimensional probability distribution (such as 2 or more random variables) observing the algorithm of the sample.
In the conditional distribution, a word is randomly sampled as a new topic distribution.Based on the distribution of all potential variables, we can infer a potential distribution of topics as shown in where   indicates that the th word in the document is assigned to the topic ;  ¬ represents all distribution subject Complexity assignments except the th word; V is the total number of words in the dictionary;   ,¬ is the number of times that items other than the th word are assigned to the topic  and the dictionary ;   ,¬ represents the number of occurrences in topic  in document  except for the th term topic;  indicates whether this Weibo text is empty.
The derivation of ( | , ,   ,   ) and ( | ,   ,   ) is as follows: we can deduce two equations: where    represents the number of times the term V in the dictionary is assigned to the topic ;    represents the number of occurrences of the th term in topic  in document ;  indicates whether this Weibo text is empty. represents the th text element of vector .

Data Set.
The data set used in this article comes from Sina Microblog.We crawled in six areas and a total of 600 users posted 234,687 microblog texts from January 2017 to November 2017.
As a first step, we filtered Sina Weibo through language codes.We mainly obtained Chinese Weibo.Subsequently, we performed some basic cleanups, such as replacing usernames and clearing labels, URLs, numbers, and common symbols.Finally, we use the Ansi Chinese word segmentation tool for text segmentation, removing all punctuation marks from strings and English markers.We have also formed a stopword list to eliminate very common and rare words.
We process the completed data in user units into 600 userlong texts.After that, we did the following for the short text: (i) When the number of short text words is less than 2, the short text is deleted.(ii) When the number of short text words is between 2 and 50, we use a novel end-to-end extended feature information retrieval technology to lengthen short texts into long texts and retain the original short texts.(iii) When the number of short text words is greater than 50, keep the original short text without any operation.
Finally, the number of valid raw short texts and effective auxiliary long texts we have obtained and the field selected for them are shown in Table 1.LDA-S: the short text is processed according to the method presented in Section 4.4 and the LDA is learned from the short text.
MB-LDA: it is a generative model based on LDA for theme mining on Weibo [31,32].
VSM: vector space model is a model that simplifies the processing of textual content to vector operations in vector space, and it expresses semantic similarity using spatial similarity.We use the VSM method to process all the user   text data to get the word weight vector   = { 1 ,  2 , . . .,   }.Where   is the weight of word  in user   text data, we use TF-IDF to calculate the weight value.In the recommendation, the method of calculating the similarity between users adopts the conventional angle cosine value of the following formula: The micro-precision rate (Micro-) is defined as in (12); micro-recall rate (Micro-) is defined as in (13); micro- value (Micro-1) is defined as in (14).Among them, TP is the correct classification of the text to the user with a certain number of interest preferences; FN is the number of model errors that classify the text into user interest preferences; FP is the number of incorrectly categorized texts that interest the user's preferences in the model into other user interest preferences. is the total number of all user interest preferences in a particular area.The higher the value of Micro-1, the better the classification performance.
In the recommendation system, recommendation is often based on the similarity between users.We take  users before extraction as recommended list to the user   to form the recommended set   = { 1 ,  2 , . . .,   , . . .,   }.For each user   in the recommended set, it is determined whether the user   is in the same specific area, and if so it is considered correct to recommend   to   .The recommended accuracy of a single user   is shown in formula (15).In certain area  user recommended accuracy is as shown in (16), where  is the total number of users under the domain where user   is located.
1   ,   belong to the same area 0   ,   do not belong to the same area ( 15) 5.3.Result.Figure 3 shows the classification performance of each topic number  on Sina microblog data set.We set the parameter  to 50/K and the parameter  to 0.01.The abscissa  is a variable, and we adjust the effect of each model by changing the size of .The ordinate is the Micro-1 rating, which shows the performance of each model in obtaining a user's preference for interest.As shown in Figure 3, LDA-S can achieve better results when the number of topics  is smaller, and when  = 4, the maximum value of Micro-1 reaches 61.8%.As the number of topics  increases, LDA, MR-LDA, LDA-L, and CLDA will reach relatively high values and then decrease gradually.This result shows that when the number of topics is too large or too small, the impact of each model will be affected.The Micro- values of MB-LDA, LDA, and LDA-L reached the maximum of 73.8%, 66.7%, and 70.8%, respectively, when  reached 8 and 10, respectively.Because CLDA combines the advantages of both LDA-S and LDA-L, CLDA achieves relatively good performance at both small and large  values.When  = 10, the Micro-1 value of CLDA reaches a maximum of 76.1%, which is higher than other models.
Figure 4 shows the relationship between the Micro-1 value of the CLDA model and the weight 1 of the short text in the CUI when  is set to ten.From Figure 4,  1 value of 0.5 with the best performance can be seen.It shows that the proportion of long text and short text in CLDA model is the same, occupying the same position.
Since CLDA models can recommend users with similar interests, we use the user recommended accuracy values described in Section 5.2 to measure the quality of the model.The results of the comparison of accuracy values between CLDA and VSM in the education, entertainment, medical, and traffic fields are illustrated in Figures 5-8.The average results of the accuracy values in the various fields of CLDA and VSM are shown in Figure 9.
In Figures 5-9, we show the results of  = 10, 20, and 30 users before extraction, respectively.The number of topics in CLDA takes the optimal result of  = 10.On the whole, the recommendation effect of CLDA is better than that of VSM.Especially in Figures 5 and 6, the results of recommendation in the field of education and entertainment are outstanding, and the accuracy of 75% and 72.1% is obtained at  = 10, respectively.
However, in Figures 5-9, we can see that the best recommendation effect of each model of Sina data set we collected is  = 10, and as the value of  increases the recommendation effect begins to decline.CLDA is more affected by the value, while VSM have less effect.Therefore, CLDA is more susceptible to t-value when it is recommended.This is where we will improve in the future.
In Figures 7 and 8, we also found deviations from the recommended results in the medical and traffic fields.Analyzing the microblog of users in these fields, we found that this may be due to the relative unpopularity of these fields compared with other fields and the relatively few users to discuss, so that the microblogging published by the users is relatively broad, the content and the topics involved are complicated, and the theme mining the interference is relatively large.How to reduce this kind of data to user's recommendation interference is the focus of work in the future.

Conclusion and Future Work
In this paper, aiming at the short text of Weibo data, combined with LDA model, we propose a novel theme model.The model can learn the potential topics of short texts and long texts simultaneously, by aggregating long texts to assist short text learning tasks, to avoid short text data sparsity.Finally, short text filtering long text is used to improve mining accuracy, making the long text and short text have effective joint use.The experimental results show that our model can outperform many advanced models, not only effectively mining the topics of interest to users, but also having the ability to be applied to the recommendation system.In future  research work, we will continue to optimize the effectiveness and efficiency of the CLDA model and reduce the interference of the nonmeaningful Weibo on the topic mining so as to adapt to various fields.Try combining more social network features and real-time microblogging data processing.

( 1 )
For each topic  ∈ {1, . . ., }, Choose long text multinomial distribution over terms,  ∼ Dir(  ).Choose short text multinomial distribution over terms,  ∼ Dir(  ).(2) For each document  ∈ {1, . . ., } (a) If there is a long text of a microblog text associated with a interest preference under a specific area of the current user, the subject of the multi-distribution of long text and short text is selected together,  ∼ Dir(  ),  ∼ Dir(  ).Conversely, only the topic of the multinomial distribution of the short text relational documents  1 ,  2 , . . .,   ,  ∼ Dir(  ) (3) For each of the  words   Choose a topic   ∼ ( | ); Choose a word   ∼ ( | ,   ,   ); 2 , . . .,   are the topic distributions of the respective microblog short texts of the related user documents.