Weibo Information Propagation Dissemination Based on User Behavior Using ELM

Information dissemination prediction based on Weibo has been a hot topic in recent years. In order to study this, people always extract features and use machine learning algorithms to do the prediction. But there are some disadvantages. Aiming at these deficiencies, we proposed a new feature, the dependency between the Weibos involved in geographical locations and location of the user. We use ELM to predict behaviors of users. An information dissemination prediction model has also been proposed in this paper. Experimental results show that our proposed new feature is real and effective, and the model we proposed can accurately predict the scale of information dissemination. It also can be seen in the experimental results that the use of ELM significantly reduces the time, and it has a better performance than the traditional method based on SVM.


Introduction
With the development of the web 2.0, social networks have become an indispensable part of people's lives.Large social networking site like Facebook, Twitter, and so forth brings a lot of happy time to people.Sina Weibo, as one of China's largest online social networks, has more than 500 million registered users.Every day these users produce a lot of social network data through continuously released and forwarded microblogging.These social network data researches help enterprises and government find the users network behavior rules and make the corresponding measures.Thus, the study of Weibo is a hot issue in recent years.
There are a lot of directions on the study of Weibo, including sentiment analysis based on Weibo [1] and Weibo personalized recommendation research [2].One high practical value direction of the researches in Weibo is studying online behavior of users and corresponding information propagation.This aspect of the study can help enterprises to understand the user behavior mode, grasp the user interest preference and recommend the interest topics, other users, and groups to the user.It can also help the government to understand the range of the spread of news, judge social public opinion direction and reactions, and adjust corresponding policies in time.
There are many researches about user behavior in online social networks and information dissemination exists.One of the common methods is extracting user behavior characteristics and use machine learning algorithm to classify and predict user behavior [3][4][5][6].In general, the researchers adopt support vector machine (SVM) algorithm.The features they widely use are the influence of user, the intimacy between the users, the interest similarity of user, Weibo content importance, and so forth.
In life, people are more concerned about the information around their side.This can also be extended to Weibo.So if a Weibo involves the geographical location, the users who are near the location will pay more attention to the Weibo than users in other areas.Although there are a lot of social network applications which use the geographical position, for example, Lingad et al. [7] studied the extraction of Weibo position related to the disaster, Hosseini et al. [8] studied location oriented phrase detection in microblogs.But on the analysis of user online behavior and information dissemination, the dependency between the geographical locations in which Weibos are involved and location of the user has not been mentioned.
Therefore, on the basis of summarizing the work before, we take the dependency between the geographical locations in which Weibos involved and location of the user as a new Our experimental results show that, with the new feature we proposed, we get a higher forecasting accuracy than without the new feature.Our experimental results also show that ELM gets higher accuracy than SVM in the same dataset.
The rest of this paper is organized as follows.Section 2 briefly introduces the related work about online social network and ELM.Section 3 introduces the data and feature we use to predict user behavior and the information dissemination model.And the experimental results are reported in Section 4. Finally, we present our conclusions and future work in Section 5.

Related Work
2.1.Online Social Network.Due to the popularity of social networks, there are many studies of social networks.For example, Marques and Serrão [9] proposed using rights management systems to improve the content privacy of social network users; Quang et al. [10] found the cluster of actors in social network based on the topic of messages; Tseng and Chen [11] proposed incremental SVM model to detect unwanted email, and so on.
Our main work in this paper is analyzing user behavior and information dissemination.There are a lot of related works of this aspect.Song et al. [3] proposed 4 features to predict if user will forward the Weibo or ignore it.The features are the authority of user, the activity of user, the preference of user, and the social relations of user.The four features can reflect the user behavior to a certain extent, but they did not consider the importance of Weibo content and the dependency between the geographical locations, which are involved Weibos, and locations of the user.Zaman et al. adopted the model of collaborative filtering based on probability [12,13].They select the user name, the number of attention, and number of words that Weibo contains to predict the forward behavior of user.Although these features have some influence on user behavior and information dissemination, these features are not the main factor affecting the user's behavior.Cao et al. [4] improved the prediction model, added the Weibo content length, Weibo importance, whether the user is authenticated user, and some other features.The added features improved the prediction accuracy of user behavior and information dissemination, but they still did not consider the relationship between Weibo mention place names and users.Some other works also give us some help.For example, some people analyzed the flow of information within the scope of the blog and made a prediction model of information transmission in [6].Sina Weibo and the traditional blog have certain similarities.We can draw lessons from the spread of the blog.Webberley et al. [14] studied the transmit delay, the depth and breadth of information dissemination on Twitter.They preliminary studied user behavior patterns and forwarding rules and have certain reference significance.
Some researchers have studied the influence of mentioned location on information dissemination.For example, Bandari et al. [15] put forward an algorithm to predict whether the news is popular enough on Twitter or whether it can trigger a heated discussion on social networking sites.This paper puts forward four features: article categories, the degree of objective, the article mentioned geographical name and people name, and the sources of article.But the study only gives the effect of the popular places to information dissemination, does not take the dependency between the geographical name and users into account.
In conclusion, we propose a new feature: the dependency between the geographical locations, which are involved Weibos, and locations of the user.

ELM.
Extreme Learning Machine (ELM) is put forward by Huang at Nanyang technological university in 2004 [16].It is a more simple and effective algorithm of single hidden layer feed forward network (SLFNs) algorithm.It can automatically choose the input weight and analyze decision output weight.It provides the best generalization ability and very fast learning speed.Huang has proved in Extreme Learning Machine a New Learning Scheme of Feed forward Neural Networks, that under the same condition of the classification, ELM rate is much higher than the SVM.According to Professor Huang previous studies [17,18], we summarize the ELM theory is as follows.
The  samples approximate to zero mean error, so we have ∑ Ñ =1 ‖  −   ‖ = 0; then, we get the formula as follows: The above formula can be written into  = .Then, the process of ELM can be mathematically modeled as the following formula: Here,  can be expressed as Therefore, we get a solution for the parameter  as where  † is the Moore-Penrose generalized inverse of matrix .
Based on the above analysis, the machine learning-based algorithm without iterative tuning can be divided into three steps.The specific process of ELM is summarized as follows.
Step 2. Calculate the hidden layer output matrix H.
Compared with SVM, ELM can be directly applied in many kinds of classification problems.In professor Huang Extreme Learning Machine for Regression and Multiclass Classification study, he has proved that the SVM obtains suboptimal solution and needs higher computational complexity [19].Therefore, ELM has the advantages that SVM does not have and has a broad application prospect.

User Behavior and Information Dissemination Prediction
In this paper, we analyze people's behavior and information dissemination on Weibo.First of all, we need to get the data from Sina Weibo.The behaviors of users in Sina Weibo are releasing, browsing, commenting, and forwarding.Release and forward behaviors are associated with information dissemination.However, the release behavior is decided by users self and we cannot control it.So our main study is forward behavior of users.
In this section, we will introduce the data and features we use and give the information dissemination prediction model we proposed.First of all, we give the dataset description.

Dataset Description.
When we get the Sina Weibo data, first of all, we choose one user and get its fans list.Second according to the fans list, we get the fans list of each user in fans list.In this method, finally we get a user's dataset.We got 96438 users in this dataset.Sina Weibo users can be roughly divided into three categories: release active users, forward active users, and inactive users.If a user does not have forward or release activity in 1 month, we think it is an inactive user.Because the inactive users do not have any contribution to the user behavior and information dissemination prediction, so we excluded these users.Finally we got 89377 users in the dataset.Then, we crawl all Weibos of these users which published between May 1, 2014, and May 31, 2014, and get 564835 Weibos.In these Weibos, there are 114943 Weibos related to geographical locations.Most of the Sina Weibos are Chinese Weibos, the geographical locations in them are Chinese location.So the small amount of Weibos which contain foreign geographical locations are consider to have nothing to do with the geographical location.We select the data from the whole Weibo dataset to build forward and ignore datasets.Because we cannot see the ignore behavior directly, we need to define the ignore dataset first.The definition of ignore dataset shown as follows.
Definition 1 (ignore dataset).If user  forwarded the Weibo published at time , the Weibos which published by the friends of the user at [ − Δ,  + Δ] and are not forwarded by the user are the ignore samples.All the ignore sample constitute ignore dataset.
Users ignore the Weibos not only because users do not like them, but also because they are leaving and not seeing the Weibos.So we selected 10 minutes, 30 minutes, 1 hour, 2 hours, and 12 hours as Δ.We also studied influence of different ignore datasets to the final accuracy.Algorithm 1 is used to find ignore dataset.
In order to facilitate our location keywords extraction, we established the province tree to identify the place name.Figure 1 is the structure of the province tree.
As we can see in Figure 1, China, according to the position, is divided into east China, south China, central China, north China, northwest, southwest, and northeast.Each region contains some provinces, and each province contains a number of cities.According to the province tree, we can identify the key word belonging to which geographical locations.We can also get the subordinate situation of the key word.
In province tree, we only consider the city name, without regard to the block name.This is because, in China, different city may contain the same blocks name.We cannot be able to accurately determine the block belongs to which city.
Our study is based on the above data.In the next section, we will introduce the features we use and the corresponding evaluation index.

Feature Description.
In this section, we will introduce the features we use.First of all, we will introduce the new feature we proposed.And then we will introduce other features we use.
The structure of the province tree.

The Dependency between the Weibos Involved Geographical Locations and
Location of the User.The Weibo involved geographical locations have been proposed before.However, they only concern whether the location name is famous and do not connect it with the locations of users.As the government starts carrying out internet political communication on Weibo, this connection becomes more and more important.Information published by the local government is likely to be paid attention to in the local and surrounding areas.The further area users will give less attention to it.We use Peking University PKUVIS Weibo visual analysis tools [20] to analyze 150 Weibos and one of it is shown as follows: 我整个人都不好了!In this Weibo, we can extract the location name Nanjing.According to the province tree, it belongs to Jiangsu province.We guess the users in Jiangsu may have high attention in this Weibo.The users far from Jiangsu may pay less attention.So we count users number in every province who forward this Weibo.According to the province field of the data, we obtained the province of these Weibos users.Sina Weibo use code to represent the provinces and cities.Table 1 shows the provinces and its corresponding code.For convenience, in the following figure, we all use the province codes in Table 2 to represent the province.Figure 2 shows the number of users in every province.
We can see in Figure 2, the local users in Jiangsu pay the most attention to the Weibo.The locations which near Jiangsu also pay much attention to it (like Anhui, Shanghai, Zhejiang, and Shandong).
According to the theory of probability, to other provinces and cities, the forwarding quantity percentage should have the same regularity with the registered users' percentage in each province.It is hard to get the registered users' percentage.But in Figure 1 we can see the economically developed provinces, such as Beijing and Guangzhou, have higher forward number than some underdeveloped areas like  Sinkiang and Ningxia.We guess this is because people in developed cities occupy more network resources and can easily get the website, so the users in developed cities may be larger than underdeveloped city.The other Weibos also have this rule.
To represent the cities' development, we found the per capita GDP in each province in 2013.Forward number and per capita GDP are not in the same magnitude.So we normalized these data.Figure 3 shows the normalized forward number.Figure 4 shows the normalized per capita GDP.
In Figures 3 and 4 we can see, in addition to geographical location mentioned in the Weibo, the forward quantity and the per capita GDP in other province are in the same regularity.For example, in Beijing, Guangdong and other regions, two figures both have a local peak.The geographical location mentioned in the Weibo makes this feature not obvious.This further proves that the geographical location mentioned in the Weibo has a stronger influence on the users who are close to it.All the Weibos we tested have this conclusion.So we use the per capita GDP to represent the registered users' percentage.And then the per capita GDP can represent the possibility of users forwarding.When the province is the geographical locations involved Weibos, we add 0.5 to the per capita GDP, which means this geographical location plays a predominant role in the forwarding behavior.The final value represents the dependency between the Weibos involved geographical locations and locations of the users.
Besides the new feature we put forward, other features are widely applied to the user behavior analysis and Information Propagation Dissemination.Researchers in [3] selected 4 features to judge user forward behavior.The features are The User's Authority, User's Activity, User's Preference, and User's Social Relations.However, the user's authority is relevant to the user's forward behavior, but the correlation is weak.Researchers in [4] selected 15 features.But some features are covered by other features.For example, when we compute the PageRank, the user fan numbers are used.This kind of features is useless and should not be used in the user forward behavior prediction.
Another research RT to Win! Predicting Message Propagation in Twitter [21] divided features into two categories.There are 7 social features (i.e., number of followers, friends, statuses, favorites, number of times the user was listed, is the user verified, is the user's language English) and 7 tweet features (i.e., number of followers, friends, statuses, favorites, number of times the user was listed, is the user verified, is the user's language English).

Mathematical Problems in Engineering
To summarize the features in the above and other papers, we selected 5 features to forecast user forwarding behavior.They are the influence of user, user release activity, and forward activity, the intimacy between the users, the interest similarity between user and content or between users and Weibo content importance.The following is these features in detail.

The Influence of User.
People always use PageRank to compute the influence of user [22].The PageRank algorithm is used to measure the importance of specific pages relative to other pages in the search engine.The PageRank formula they use is shown as In this formula, pr  represent the PageRank value of user , Follower () represents the fans list of user , Friend () represents the collection of users that user  pays attention to,  is the damping coefficient, and  is the total number of users.

User Release Activity and Forward Activity.
Because of the different behaviors of the user, the user activity can be divided into two aspects, the user release activity and forward activity.The user release activity is the Weibo number published over a period of time.We can use formula (7) to compute it: The  in formula (7) represents the Weibo number published over a period of time,  is the total number of Weibo,  is the unit time.In general, we set  to 1 day.
The forward activity is percentage of users forwarding Weibo account for all published Weibo in one day.We use formula (8) to compute it: is the number of users forwarding Weibo in th day,   is the number of users releasing Weibo in th day, and  represents the forward activity.The higher the  is, the more active the users are.Users with high forward frequency play a bigger role in information dissemination.

The Intimacy between the Users.
Because the forward behavior in Weibo can reflect the interaction between the users better, we compute the intimacy between the users by calculating the percentage of Weibo published by the upstream user in the forwarding Weibo of the user.The formula we use is In this formula,  V represents the number of the Weibos of user V which appears in the forward Weibo of user .  represents the total number of forward Weibo of user .

The Interest Similarity between User and Content or between Users.
Weibo can reflect the interests of users.The larger the interest similarity between user and content, the greater the chance user forward.The larger the interest similarity between user and upstream user, the greater the chance user forward.So we need to compute the interest similarity.Because the user's interest is the change over time, we need to analyze the Weibo which release time near a few days.Interest space is extracted from weibo, and the following is the process of compare.
(1) Collect user interest.We select a user and collect the user  Weibo published nearly five days.These form the user interest space   = {  , 0 <  < }.  is the interest space of user  and   is the th Weibo of user .
(2) Participle.For Weibo in Chinese, we use the Chinese Academy of Sciences Chinese lexical analysis system ICTCLAS do the word segmentation [22].For Weibo in English, we use space.We get the words level interest space   = {  }.   is the th word.
(3) Remove the stop.We remove the stop word and get the new words level interest space  = {  }.
(5) For the two users  and   , we calculate the similarity of   and    .For the user  and the content , we calculate the similarity of   and   .We use Jaccard formula to calculate the similarity [23].The Jaccard formula is 3.2.6.Weibo Content Importance.Usually if a Weibo contains significant events or popular information, the forward rate will be high.So the importance of Weibo content can help us analyze Weibo information dissemination.Based on computing weight of TF-IDF (term frequency inversed document frequency) algorithm on the text classification field, we calculate the importance of Weibo [24].The thought of this algorithm is that in a specific document the higher the frequency of word appears in the document, the more important the word is; the lower the frequency of word appears in other document, the more important the word is.We can use formula (11) to calculate the importance: In this formula,  represents the word  in the Weibo ,   represents the number of  appearing in ,  represents the number of Weibo that Weibo set  contains, and   represents the number of Weibos containing  in the Weibo set .The TF-IDF of Weibo  can be computed by adding the TF-IDF of all the word in :

Information Dissemination Prediction Model.
According to the features in Section 3.2, we use ELM to forecast the user forward behavior.According to the predicted forward behavior, we forecast information dissemination scale.
The forward behaviors in Weibo can be divided into 3 aspects: direct fans forwarding, indirect fans forwarding, and not fans forwarding.We count the each percentage of 3 forward behaviors in different scales of Weibo. Figure 5 shows each percentage of 3 forward behaviors from the size of 100 to the size of 1500.
We can see from Figure 5 that forward behaviors are mainly composed of direct fans and indirect fans.The percentage of not fan users is almost 0.So we ignore the forward behaviors of not fan users.
When we make the prediction, we start from Weibo publishers.And then we traverse its list of fans and predict if the fan will forward the Weibo.If the fan forwards it, the forwarded number increases 1.Then, traverse the fans list of this user.We repeat iteration like this until no users forward the Weibo.The prediction model can be represented by a tree.Figure 6 is a simple example of prediction model tree.
The gray point in Figure 6 is the publisher of Weibo.The black points are the users who will forward the Weibo.The white points are the users who will not forward the Weibo.When we make the prediction, we start from the user  0 and traverse its fans list.We find the fans list contains 3 users:  1 ,  6 , and  7 , and the  1 is the forwarding point.Thus, the forwarded number increases 1 and we traverse the fans list of  1 .The fans list of  1 contains 2 points,  2 and  3 . 2 is not the forwarding point, but  3 is the forwarding point.So the forwarded number increases 1 and we traverse the fans list of  3 .The points in fans list of  3 are  4 and  5 , and both of them are not the forwarding points.So we come to  6 .The handling of  6 is similar to the above, and in this method, we finally got all the forwarding nodes.We use Algorithm 2 to build the information dissemination prediction model.In this algorithm, we assume that each user forwards the Weibo once and the publisher will not forward the Weibo.

Experiments and Results
In this section, the predicting performance is evaluated by using ELM.In addition, we compared the results between ELM and SVM based on adding the new feature we proposed and do not use the new feature.We also test the proposed information propagation prediction model and give it performance in this section.

Users Behavior Prediction.
According to the data we crawl from Sina Weibo, we select 133190 forward data as the forward sample.According to Section 3.1, the numbers of each ignore sample are shown in Table 2.
We use ELM to forecast forward or ignore behavior of users.The source code of ELM can be obtained from the website (ELM Source Codes: ELM Source Codes: http://www.ntu.edu.sg/home/egbhuang/).We also compare the results between ELM and SVM.The tool of lib-SVM is used in this paper, which can be obtained from the website (data set: http://www.csie.ntu.edu.tw/∼cjlin/libsvm/).In order to evaluate the effect of forecast model, we choose the evaluation index of information retrieval, including accuracy, recall, and the value of 1.With 10 times of cross validation method validation algorithm, we get the user forward behavior prediction results shown in Tables 3 and 4. Table 3 shows the performance using ELM and Table 4 shows the performance using SVM.this conclusion, we draw the figures to show the details.And Figures 7 and 8 show comparison charts of using ELM and using SVM.
As can be seen from the Figures 7 and 8, when using the dependency between the Weibos involved geographical locations and location of the user feature, the prediction results are better than without the feature.
To give a more intuitive description of the comparison, we also show the performance of ELM and SVM in a figure .Because 30 minutes has the best performance in both algorithm, we only compare the performance in this case.Figure 9 shows the comparison between ELM and SVM.
We can see in both cases that the predicted results obtained by ELM are higher than the SVM prediction results.This proves that using ELM algorithm is better than using SVM algorithm.ELM algorithm has good performance.We can also see the new feature brings better performance.

Information Propagation Prediction.
According to the algorithm in Section 3.3 of and prediction results of ELM, we predict the scale of the Information propagation.We choose 30000 original Weibos of 15375 users to verify our model.We count average user forward quantity proportion in every jump from the initial release users (jump: the shortest distance from users to the initial release user).Figure 10 shows the average of users' forward percentage in each jump.It can be seen from Figure 10 that after 5 jump, the percent approach to 0. This proves that in our dataset all the forward behaviors happen at the first five jumps.This proves that the Weibo is a widely spread but deep low social network.
Based on this theory, our information propagation prediction stops at the fifth jump.This can avoid the excessive iteration.Figure 11 shows the accuracy we predict in every jump.The accuracy of jump 1 is the accuracy of the first forward layer of 30000 Weibos.Others are the same.
We can see in Figure 11 that the accuracy of the first jump is the highest.Accuracy reduces with the increase of the jump count.This is because when we do the prediction, the error is constantly accumulation.When the jump comes to 5, the error has been accumulated to a considerable scale.So the accuracy becomes very low.
In order to determine the scale of information dissemination, we divide the scale according to the 10 n order of magnitudes.If the information dissemination scale we predicted is in the same order of magnitude which is the actual information dissemination scale, we can say the prediction is right.We calculated the average predict information dissemination Mathematical Problems in Engineering scale accuracy of 30000 Weibos.Figure 12 shows the average prediction accuracy of each Weibo.
As can be seen from Figure 12, for different Weibo from different users, our algorithm accuracy is around 70%.This is because for the Weibos whose forward deep close to 5 or more than 5 jumps, the error of our model has been accumulated to a considerable scale and brings decline in accuracy.
For the selected data, our predicting result is very stable.This proves that our algorithm is real and effective.

Conclusions
Online behavior of Weibo users and information dissemination analysis is a hot issue nowadays.In this paper, we analyzed the features of Sina Weibo user behavior and predicted the information transmission.We proposed 8 features to analyze user behavior.They are the dependency between the Weibos involved geographical locations and location of the user, the influence of user, user release activity, user forward activity, the intimacy between the users, the interest similarity between user and content, the interest similarity between users, and Weibo content importance.The feature (i.e., the dependency between the Weibos involved geographical locations and locations of the users) is the new feature we proposed.We used ELM to analyze if users will forward or ignore a weibo.Our experiment results show that the feature we proposed is very effective and ELM gets better results than SVM.We also test the different performance between the different values of Δ in ignore dataset.We found that when Δ is 30 minutes, the performance is the best.So we use the 30 minutes ignore dataset to build the training set.Based on that, we proposed information propagation prediction model and calculate the scale of the information propagation.The experiment results show that our model has a good performance.
The features and model we proposed in this paper can give some help to businesses and government.They can use our model to predict the scale of the information propagation before they publish it.If the scale is small, they can use our feature to adjust the information text.The model and features has very high practical value.
However, there is still something we need to improve in this paper.For example, when considering information dissemination size, we do not concern users forward their own Weibo and people may forward the Weibo many times.We will take it into consideration in the future.

Figure 2 :
Figure 2: The number of users in every province.

Figure 3 :
Figure 3: The normalization of forwarding number.

Figure 7 :
Figure 7: Comparison charts of using ELM.

Figure 10 :Figure 11 :
Figure 10: The average users' forward quantity proportion of each jump.

Figure 12 :
Figure 12: Average prediction accuracy of each user.
Weibos set  which published by the friends of the user ; Weibos set  which user  forward.Output: Weibos set  which user  ignore.(1) Any Weibo ,  ∈ , read the publish time   ; Inputs:

Table 1 :
Province and its code.

Table 3 :
User forward behavior prediction results using ELM.