A Hotspot Information Extraction Hybrid Solution of Online Posts’ Textual Data

Online posts have gradually become amajor carrier of network public opinion in social media, and the social network hotspots are the important basis for the study of network public opinion. ,erefore, it is significant to extract hotspots for monitoring Internet public opinion from online posts textual big data. However, the current hotspot extraction methods are focused on the users’ features that are based on textual big data with spam and low-quality content. Meanwhile, these methods seldomly consider the time span of posts and the popularity of users. Accordingly, this article presents a hotspots information extraction hybrid solution of online posts’ textual data. Firstly, a filtering strategy to obtain more high-quality textual data is designed. Secondly, the topic hot degree is presented by considering the average number of replies and the popularity of the participant. ,irdly, an improved coword analysis technology is used to search the same topic posts and Bisecting k-means clustering algorithm using repliers’ popularity and key posts are designed for studying and monitoring the hotspots of online posts in a valid big data environment. Finally, the proposed algorithms are verified in experiments by extracting the hotspots of online posts from the dataset.,e results show that the data filtering strategy can help to obtain more valuable information and decrease the computing time. ,e results also demonstrate that the proposed solution can help to obtain hotspots comparing the traditional methods, and the hot degree can reflect the trend of the online post by comparing the traditional methods.


Introduction
With the rapid development of mobile communications and networks, the Internet increasingly integrates into our life. It is reported that there are now more than 4 billion Internet users around the world. Most Internet users spend an average of six hours surfing the Internet, and 3 billion people now use social media, such as Twitter, blogs, Bulletin Board System (BBS), and podcasts [1,2]. It is known that online posts have gradually become an important tool in social media for the exchange of information. An increasing amount of public opinion is now spread by social media, especially through BBS [3][4][5]. Since hotspots directly reflect public opinion, studying and monitoring the hotspots of social media becomes more important for public affairs.
Social media has become one of the most important and popular carriers and distributors of the current online public opinion [6,7]. Compared to the traditional public opinion channels, online posts have some unique features, such as a wider audience range, greater influence, faster propagation speed, and large amount of data [8][9][10]. For obtaining and monitoring public opinion hotspots, an increased number of studies focus on this field from different perspectives. In general, the current studies mainly use natural language processing, data mining technologies, machine learning, and other methods to monitor hotspots and explore propagation [11][12][13][14].
Currently, text data is still an important medium for information dissemination on social networks [15,16]. To study complex dynamics in social networks, the extraction of hotspots from massive textual data becomes one of the important steps. On the one hand, the current hotspots' extraction methods are simple to collect the user-related feature and mostly based on textual big data with spam, irrelevant, and low-quality content. In social media, there are many spam information [17][18][19], such as paid posters and fake replies, as shown in Figure 1. Advertising posts and replies are a good example in BBS. Such corresponding users' featured information based on invalid or incomplete data can be very different from real one, especially for hotspots and public opinions. On the other hand, the main methods seldomly take into account the time span of posts and popularity of repliers. Firstly, time span of posts is a significant factor in hotspots extraction, as hotspots of social networks are the collective action of users in a short time (for example, a collective response to an event in BBS). Secondly, it is reported that popularity users (repliers and main posters) play a significant role in Internet public opinion [20]. However, few studies address these problems. Due to the complexity and features of social media such as BBS, monitoring of public opinion hotspots still faces the following challenges: (1) How to obtain more valuable data by filtering a large amount of spam textual data (2) How to find the key posts according to the association among multiple posts for the same topic (3) How to search real hotspots by considering the valuable repliers and key posts Accordingly, for solving the problems, a hotspots' information extraction hybrid solution of online posts' textual data is proposed based on the feature of users in social networks. e solution contains three main steps. Firstly, a textual data filtering strategy is used to obtain a more valid dataset. An improved co-word analysis technology is used to search the same topic posts. Secondly, bisecting k-means clustering algorithm based on poster popularity and key posts are proposed to obtain the hotspots of online posts. en, the hot degree is proposed to search the real hotspots. e proposed methods are implemented in a real experiment, where the results demonstrate the effectiveness of the solution. e rest of the paper is structured as follows. Section 2 discusses related studies on the Internet public opinion and current challenges. Section 3 introduces hotspot monitoring and public opinion communication characteristics. Section 4 introduces the cluster hotspot monitoring based on PR values and bisecting k-mean algorithms. Section 4 presents the results of an experiment using the proposed methods and our dataset. Section 5 concludes and discusses the paper.

Related Work
In this section, we present existing studies on the public opinion analysis of BBS and monitoring hotspots. ese studies are used as a basis for our work. We review related research from two aspects: network public opinion and hotspots.

Public Opinion.
In [21], the natural language processing and machine learning techniques are used to interpret sentimental tendencies related to users' opinions and predict real events. In [22], a public opinion dynamics model for an online-offline social network context is provided and conditions to form a consensus in the proposed model are analyzed. In [23], the authors propose a method to recognize network public opinion leaders by using Markov logic networks, and a recognition system is designed and implemented. In [24], a cross-network public opinion spreading model is created in a combined social network environment. Two network nodes are assumed in this paper. In [25], the author constructs a dictionary monitoring sentiment computing model using text words and labels as the input parameters. In [26], a new method is provided for sentiment computing for news and events by constructing a word emotion association network. e authors provide a word emotion computation method to obtain initial words.
ese studies mostly focus on public opinion based on the assumption that the dataset is always valid.

Hotspots. Zhao et al. [27] present a Social Sentiment
Sensor (SSS) system on Sina Weibo to detect daily hotspots and analyze sentiment distributions related to these topics. Clusters of topics that describe the same issue are formed and ranked based on popularity to exploit the resulting hotspots. In [28], the authors use a clustering method to obtain candidate topics on BBS and the evolution theory to calculate the heat of candidate topics and obtain hotspots based on it. Hao and Hu [29] propose a method based on a baseline model to solve the topic drift problem of network BBS. Liu and Li [30] adopt text mining approaches based on a vector space model and k-means clustering to group Internet public opinion hotspots. Li [31] uses an emotion analysis technology to analyze the emotional polarity of network BBS Chinese texts and a k-means algorithm and the SVM to cluster the contents of posts considering each class as a hot topic. Chen et al. [32] design a similarity analysis algorithm of Internet public opinion based on information entropy, which can cluster and identify hotspots and crisis events.
e above studies provide a useful basis for this study, but there are some gaps that need to be filled, especially regarding the public opinion analysis of online posts. e unresolved issues are related to the validity and usefulness of data and the popularity of posters when used in the clustering of hotspots. Based on the above research, a data filtering strategy is introduced to improve the quality of data. Improved co-word analysis and Bisecting k-means clustering algorithms are designed using time spans and popularity to obtain more accurate results.

Mathematical Model of Online Post
In this section, we first build a mathematical model of online posts based on their characteristics and then use it to study hotspots.

Scientific Programming
Let S all be the set of all posts at time t. Let S valid and S invalid be the set of valid and invalid posts during period t. So, S all can be represented as follows: (1) Assume that there are m valid posts, so S valid can be expressed as S valid � s valid where n is the size of the valid topics set T and T similar is similar to T on the topic set in period t. In other words, T similar and T have similar keywords. It is known that all posts (S all ) during a period have multiple topics; therefore, getting the post relevant to the one topic is the basement for extracting the hotspots. Each topic T i ∈ Tcorresponds to the keywords set(X all (T i )), which contains similar keywords' set (X all (T i ) � X(T i )+ X similar (T i )). e valid keywords set has n i valid keywords, and we can assume that its keyword set is We use the notation sT i to indicate that post s(s ∈ S all ) is relevant to T i . at is, s contains the keywords of T i . We can use the following equation to define this relationship between a post and a topic: According to (3), the relevant posts are the post contents that contain the keywords of the one topic. erefore, we can obtain the relevant posts set S all (T i ) of topic T i as where time f 0 and time f end represent the time of the first and last reply of post s during a certain time, respectively, and day(·) is the number of days disregarding hours and minutes.
Definition 2. e reply number (the total number of replies of one post) is equal to where TR(s)is the total number of replies of post s and|F|is the total number of the first replies.
Definition 3. Post participants (p) are the post creator and repliers. Let P(s i ) be the participant's participation number: where P first (s i ) and P second (s i ) are the first repliers and second repliers, respectively.
where fr is the frequency of the posts of participant p, N is the total number of the discussion posts of p, TSP is the total number of repliers to all discussion posts of p, and a and b are the coefficients corresponding to the frequency and the total number of repliers, respectively. We use different values to denote different post values for a topic. e value of a post is mainly determined by two criteria: the average number of replies and the popularity of the participant. Accordingly, the following formula is used to calculate the value: where α and β are the coefficients corresponding to the average number of replies and popularity, respectively. Based on the above discussion, we can calculate the topic hot degree (HD)hot(T)as where S(T) is the valid post set about topic T and ϑ, θ are the coefficients of the total number of posts and post values, respectively.
Definition 6. Hotspot is the topic that gets the maximum value of the hot degree. For the topic set T � T 1 , T 2 , . . . , the problem of hotspot search becomes arg max hot T i , where S(T i ) is the valid post set of topic T i and the constraint conditions in equation (12) are restricted vales scope of the post, topic, and keywords, respectively.

Definition 7.
Hot post is the post with the maximum value of a post. And, the hot post s max of topic T i can be expressed as follows: It is clear that the main problem related to formulas (12) and (13) is to obtain valid posts, replies, and keywords.

Spam Data Filtering Mechanism and Improved Cluster Hotspot Monitoring Algorithm
To determine the hotspots of BBS, we use a filtering mechanism to obtain more valuable data and an improved cluster hotspot monitoring algorithm to find hotspots. We focus on text data filtering, extracting keywords, constructing the common word matrix, and searching the hotspots and hotspots. e main process involves the following steps: the identification of postspamming and fake replies to increase the post and reply values, the application of a text rank-based keyword extraction algorithm to calculate the PageRank Value of the candidate keywords and obtain their PR values, the determination of the keywords based on posts' PR (PageRank) values, and the construction of the co-word matrix for these keywords, as shown in Figure 3. As a result, we determine the hotspots by sorting the above results. the participant's reply number and the total reply number of the post. Rule 1 can be formulated as follows: where δ is the DoP constant.
whereε is the MNR constant. We use Rule 1 and 2 to filter out paid or spam posts. It is known that, in a real BBS, there are many fake replies, which are not related to the topic, such as advertising. Such replies must be deleted from the post set as well. Rule 3(lexical filtering): the predefined vocabulary set is denoted as A � a 1 , a 2 , . . . for topic T i . If a reply text (Text reply (f ij )) does not contain an element ofA, it is considered invalid and gets deleted from the replies set. e rule for getting a valid reply can be described as follows: where Text reply (f ij )⊳A means the reply text contains a predefined vocabulary element of A, F all is all reply for the topic T i , and f ij is the jth reply text of topic T i . Algorithm 1 shows the details of the post-text filtering for the topic T i . In algorithm 1, the text data filtering can be mainly divided into the following steps. Firstly, the relevant post set of T i can be obtained with equations (3) and (4). en, according to Rule 1 and 2, the degree of participation and the minimum number replies are employed for deleting the invalid posts which are beyond the constraints of (14) and (15). Moreover, using Rule 3, the invalid replies are selected and removed from the reply set F all . Finally, the valid post and replying sets (S, F) are returned.

TextRank-Based Keyword Extraction Algorithm.
Based on Section 4.1, we can obtain valid posts and replies by adopting the filtering mechanism. In this section, we further extract keywords based on their text ranks. e main steps are as follows.
Step 1: we divide the text of a reply into a word list. en, we order the words in the list. Namely, list(V) � [v1, v2, . . .].
Step 2: after filtering the element of list(V) according to the following rules, we obtain the list of candidate keywords.
Step 3: we use the following synonym processing rule to build candidate keywords. Rule 4 (synonym processing): let C be the synonym keywords set: where c i (main) and C i (syn)are the main word and its synonym set, respectively. If a word is a synonym, we use the main keyword to replace it and then merge the same words and build candidate keywords (list(X)).
Step 4 (build candidate keyword map): the candidate keyword map G�(i, list(i)), where i is the candidate keyword, list (i) is the set of words co-existing with i in the window, and, for the word j in list (i), the cooccurrence number between i and j is denoted as weight w ij .
Step 5 (iterate operation): we set the number of iterations (L), according to the page rank algorithm [33,34]  Scientific Programming where PR(i) refers to the PR value of keyword i, j denotes the keywords co-existing with i, l denotes the keywords co-existing with j, and d is the damping coefficient.

Common Word Matrix for Obtaining the Same Hotspots and Hot
Posts. Common word matrix: n keywords are selected according to their PR values (Section 4.2). e keyword set is W � w 1 , w 2 , . . . , w n . e position of the coword matrix corresponds to the semantic distance between two keywords. e formula for the semantic distance dist(w i , w j ) between two keywords (w i , w j ) is where count(w i , w j ) represents the number of cooccurrence events between keywords (w i , w j ). e smaller the semantic distance between two keywords is, the more likely the two keywords belong to the same hotspot. erefore, the common word matrix (CA) can be represented as Searching the hotspots and hot posts: the common word matrix CA is transformed into a point set.
en, all the points are treated as the first cluster, and the first cluster is divided into two parts. Select each cluster that can minimize the SSE (sum of squared errors) value and divide it into two new clusters. is loop continues until the number of clusters equals the predefined number K. e hotspots and keywords are obtained based on the above steps. en, we can obtain the hot post using equation (10). By using the strategy explaining in equation (11), we can get the topic hot degree for every topic. Also, the hotspot and hot post by sorting can be identified.

Experimental Results
Experiments were conducted to evaluate the performance of the proposed algorithm using a real dataset. e results of the experiment are used to analyze the proposed approach. is section covers the simulation parameters, setup, and results.

Dataset and Experiment Setup.
Dataset: the dataset is gathered from three online post websites (W1 (Baidu Tieba post): https://tieba.baidu.com; W2: https://bbs.tianya.cn; W3: http://www.xici.net). e three websites are the most famous online BBS platforms in China, which have more than 150 million active users in 2020. e post textual data was collected from these BBS and covers the whole year of 2018. Figure 4 shows a screenshot of the online post of W1, which is a classical online community post based on textual data.
To test the proposed strategies, we select three typical subjects ("Computer game" (S1), "Exam" (S2), and Input: S all , ε, A,δ Output: Computing the relevant post set S all (T i )of T i //According to equations (3) and (4) for i � 1: |S|//Degree of participation filtering Calculate P(s i )  6 Scientific Programming "Nanfang College" (S3)) from the above BBS websites. e data was obtained using a crawler. Meanwhile, data visualization software was designed for analyzing these textual data of BBS, and the data filtering algorithm was used in the software, as shown in Figure 5. e dataset obtained has 16,373 posts and 100,197 replies from January 1, 2018, to December 31, 2018. e parameters of the experiment are shown in Table 1.

Results' Analysis.
Valid posts and replies: the proposed data filtering mechanism is used to obtain valid data from the dataset. Figure 6 shows the results of the valid posts and replies of the above three subjects (S1, S2, S3). It is a wellknown fact that, by using filtering strategies, we can effectively delete spam posts and replies. Figure 6(a) provides the comparison of the results of our filtering rules and the raw data in some different subjects for different subject posts.
Our mechanism can decrease the number of posts by 109 and 513 by using Rules 1 and 2, respectively, compared to the raw data that was not filtered in S1. Accordingly, by using our filtering rules, more than 13% of the invalid post is obtained. Figure 6(b) shows the results of the filtering of online post repliers based on the proposed methods in different subjects by using rule 3. Similarly, the methods can effectively filter out invalid replies. Particularly, Rule 3 can filter more than 30% invalid replies. e results demonstrate that the proposed filtering mechanisms can decrease the number of invalid posts and replies. Also, the filter can reduce datasets and improve the efficiency of searching for hotspots. Furthermore, the results show that the proposed data filtering algorithm has different post and replies effects on a different subject. In other words, the larger the scope of the subject, the bigger the post and replies. Topics with a wide scope of topics are more likely to have spam posts and replies.
Precision: for verifying the data filtering algorithm performance of precision, the part raw data (10%) of the subject of S3 is selected. en, these BBS data are filtered by the manual and the proposed data filtering algorithm, respectively. Figure 7 shows the precision percentage results of posts and replies of subject S3 by using the proposed method in different BBS websites. From the results, it is easy to get that the precision of filtering posts is more than 92%, and the precision of filtering replies is large than 85%. e results demonstrate that the proposed data filtering algorithm has a good effect on the precision of spam posts and replies.
Computing time: computing time is an important metric to evaluate the performance of the data filtering algorithm. erefore, the computing time results to collect the number of users in different subjects is given in raw and filtered data, as shown in Figure 8. It is easy to get that the S1 spends the most computing time in three subjects. And, the proposed filtering can decrease more than 15% computing time. In other words, the filtered data used less time to search the number of users than the raw dataset in all subjects. e results show that the presented strategy can save more computing time by using a data filtering algorithm.
Hot degrees: we use hot degrees to search for hotspots and posts. When the hot degree of a post reaches 3, we consider it a hotspot. After calculating hot degrees and searching for hotspots, five hotspots were selected based on their hot degrees. Hot degrees of different topics can be obtained using the hot degree calculation method. Meanwhile, the same five hotspots of the maximum number of post and repliers are calculated in the same dataset. e results of different metrics are shown in Figure 9. Figure 9(a) provides the hot degrees of different topics from the 90th to 105th days. e values of the hot degrees of the five topics are 3.6, 5.6, 5.6, 11.75, and 5.08. It is noticeable that topic 4 is the hottest topic during this period. Figures 9(b) and 9(c) show the total numbers of post and repliers in the same time period. It can be seen that topic T1 is the hotspot using the different degree, and it has the highest values of the three metrics. Namely, the hot degree can reflect the hotspots of online posts.
e values of hot degrees on subject S1are obtained in different datasets from different websites. Figure 10 shows the hot degree results from the 5th to 235th days related to topic 1. From Figure 7, it can be seen that topic 1 has three   Scientific Programming peaks at the 60th, 120th, and 190th days, and the hot degree of three websites has the same trend in different online post websites during the monitored period. e proposed hot degree can directly reflect fluctuation trends. Specifically, a hot degree can demonstrate the trend in terms of repliers and users, as in our strategies we merge post users and replies. In summary, the proposed method can effectively solve the social media hotspot problem.

Conclusion
e online posts have become public platforms for expressing personal opinion, so their monitoring and online hot topic search gained more significance. Considering the weight of different users, the extraction of hotspots from massive textual data with spam data become one of the important bases for study the public opinion of the social network. By collecting and analyzing text information on online posts, current hotspots can be obtained. is article adopts a data filtering mechanism, common words, and clustering technology for online hotspots search, using a time span, poster popularity, and PR values. en, hot degree is used to evaluate the hotspots of online posts based on the number of replies and the popularity of the participant. e proposed methods are implemented and applied to a BBS dataset. e results show that the proposed method can effectively filter out invalid data, compress datasets, save more computing time, and improve performance. At the same time, the results demonstrate that the proposed method and hot degree can also reflect changes in the trend of the hotspots of online posts.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare no conflicts of interest.