Research on Multifeature-Based Superposter Identification in Online Learning Forums

,


Introduction
With the improvement in online learning and remote education, discussions in online forums become increasingly effective to facilitate learning. rough online messages with teachers, learners may solve problems and ease emotional loneliness during learning. Previous research study has proved that the opinion of leaders plays an important role in online learning and has a positive effect on interactions [1,2]. eir posts may significantly help themselves and others to learn. To differentiate between the opinion leaders in social networks, this paper terms them as superposters in learning forums. At present, research is almost nonexistent on superposter identification in online learning forums, unlike the opinion leaders in a traditional sense where much research exists [3][4][5][6][7][8][9][10]. In the context of online learning forums, superposters refer to the users who are active in posting high-quality information, which may help learners to solve problems and prompt learning [11]. Considering the differences in discourse environment, the superposters of online learning forums differ from popular opinion leaders in social networks. Opinion leaders in social networks mainly spread information via the Internet and thus exert an influence on information receivers in terms of public opinions and tend to affect public opinions. erefore, according to the explanation of superposters and opinion leaders, there are similarities and differences between them. Both are active in interactions; superposters aim to boost cooperative study, but opinion leaders try to influence public opinions by swaying others. How do we identify superposters among thousands of online learners? Previous similar research was made on the basis of social online communities and applied in the fields of society and economy, but little was based on the forums of online learning platforms and applied in the field of education [5]. Although Reppel [12] asserted the application was applied in education, the research was mainly made on blogs for identifying opinion leaders in online learning communities. rough analysis of authentic online learning forums and the characteristics of superposters, this paper obtained three points with which superposters and ordinary learners were distinguished, whereby a model framework for superposter identification in online learning forums was constructed. e framework considers both of the network interaction structure of learners and the discourse features of posts, so as to better identify superposters. Experiments showed that considering the above appropriate different features was effective in identifying superposters in online learning forums.
ere are two main contributions of this work: one is proposing a new framework to identify the superposters in learning forums, and the other is proving the framework is useful for identifying the superposters in online learning forum, by experimenting on real-online learning forum corpus.
In the following, the related work will be reviewed in Section 2, the superposter identification framework will be detailed in Section 3, the experimental design and result will be analyzed in Section 4, and discussion and summary will be made in full in Section 5.

Related Work
Opinion leaders play a significant role in social networks. As a result, identifying opinion leaders in the context of social networks attracts the great attention of the related researchers like those in the fields of sociology and business. e role includes participation in social politics [13], promotion and popularization of new products or services in the field of business, and effect on decisions made by other consumers [6,12]. According to the current literature, the following are main methods of identifying opinion leaders: (1) Identification based on network interaction structure: on the basis of the structure, in combination with users' social influence and attributes of web links, this is to reflect users' centrality and prestige in social networks through web link addresses, such as the famous PageRank, HITS algorithm, and social network analysis, which are used to identify opinion leaders [4,6,10,14]. With these methods, the network interaction structure with graph models is simulated to observe the importance of user nodes, which emphasizes the structure but fails to consider the comments of opinion leaders; moreover, in the event network nodes increase for the purpose of increasing the amount of information, the graph structure will become so complicated that opinion leaders cannot be identified effectively [15]. (2) Identification in combination with network interaction and post contents: in consideration of such limitations as sole dependence on network interaction, plenty of research is made to identify opinion leaders in combination with post contents and network interaction. Based on social network analysis and user comments, Bodendorf and Kaiser explored the opinion leaders in online communities and the propagation trend of the public opinions they make [16]. In combination with the features of network structure and user behavior and the emotional features of posts, through analysis of multidimensional features, Cao et al. studied the social network-based opinion leaders [17]. Li and Du constructed an opinion leader identification framework with blog contents, author attributes, reader attributes, and the network relationship between blog authors and readers to identify the opinion leaders committed to word-of-mouth marketing in online social blogs [5].
Although good results were achieved, the above research depended too much on influence or centrality, making it impossible to reflect the quality of the contents published by opinion leaders and thus accurately identify opinion leaders. Meanwhile, they were made based on social networks instead of identification in the field of education. In accordance with the features of expertise, novelty, influence, activity, longevity, and centrality, Li and Ma et al. built an indicator framework to identify opinion leaders [18]. Huang et al. identified superposters according to the quantity and quality of learners' posts in course forums [19]. is is rare in terms of opinion leader (superposters) identification in online learning forums but fails to reflect the quality and role in cooperative study of superposters' posts.
In the opinion of the author, superposters in learning forums are different from opinion leaders in social network, and they must have a certain cultural quality and cooperative study skills, which are not reflected in the above research studies. erefore, in consideration of the limitations of the abovementioned studies on opinion leader identification, this paper proposes a superposter identification framework based on language expression, content quality, and interaction structure, so as to identify superposters among the participating learners and learning supporters.

Superposter Identification Framework
e authors consider that the superposters in online learning forums should be as follows: (1) be active in posting/replying; (2) be excellent in language expression; (3) post high-quality posts and have a good ability to learn, or be knowledgeable, or accurately reflect learning needs, or provide other assistances to online learners. ese not only reflect the importance of poster nodes in interaction through forums but also indicate the authority of their posts. Based on these features, the paper proposes the framework (see Table 1) for superposter identification in online learning forums (Chinese as the working language), as shown in Table 1. According to the definition given by this paper, for a superposter, we expect to reflect the language expression level of learners, quality of post contents, and activity of interaction, respectively, through language expression, content quality, and social network interaction.

Social Network Interaction.
In social network analyses, degree centrality is an important index that measures the social interaction of individuals as well as a common index that evaluates the social status and prestige of individuals, including out-degree centrality and in-degree centrality; out-degree centrality is used to reflect the replies of a poster (learner or learning supporter) a i to others' posts, as expressed with the following formula: where a i is the i th learner of learners set A; N is the total number of learners, similarly hereinafter; NumReplyOut(a i ) is the reply of a i to others, i.e., the number of linkout of node a i in interactive networks, which reflects the importance of node position; N j�1, is the ratio of the number of a i 's replies to others' posts to the total number of replies (excluding all self-replies), which reflects the degree of interaction in which a i participates; and traditional algorithms only consider NumReply Out(a i )/(N − 1) but ignores the degree of interaction reflected by . Degree centrality, also known as Prestige [20], may reflect the replies of other posters to the posts of a i , as expressed with the following formula: where a i is the i th learner of set A; N is the total number of learners; NumReplyIn(a i ) is the number of link in of node a i in interactive networks, which reflects the node prestige; and is the ratio of the total number of others' replies to a post of a i to the total number of others' replies (excluding all self-replies), which reflects the centrality of a i in interactive networks but is rarely considered in traditional algorithms. erefore, the index of social network interaction of a i is calculated as follows: where σ is a weighting parameter.

Language Expression.
Plenty of research on identification of opinion leaders failed to consider the language expression skill of an opinion leader. However, whether in terms of interaction in social networks or online learning forums, an opinion leader or a superposter must ensure fluent language expression and owns a certain cultural quality. If a post involves violent words or unclear expressions all along, no matter how innovative or important it is, other users (learners of learning forums) may refuse to discuss further. For this reason, this paper makes a survey on language expression with three indexes including "word normalization," "term nonnormalization," and "language elegance." ese are relatively easily achieved and may reflect the language expression skill of posters.

Word Normalization.
Word normalization is to survey the frequency of Class I and Class II commonly used Chinese characters in posts and thus verify the normalization of the words used by learners. When uncommon words are used in posts to appear intellectual, learners may find it difficult to achieve optimal learning, thus limiting the spread of information. To facilitate survey, the index of the normalization of the words used by a i is defined as follows: where CH I Freq(a i ) and CH II Freq(a i ), respectively, are the frequency of Class I and Class II commonly used Chinese characters in all posts of a i ; Total CH Freq(a i ) is the total frequency of Chinese characters in all posts of a i ; and CH I Type(a i ) and CH II Type(a i ), respectively, are the number of the types of Class I and Class II Chinese To survey the similarity between the reply content and the thread title Expertise of content (Ec I) To survey the knowledge points involved in posts Social network interaction (Sns Index) To survey the intermediate status of posters in network structure In-degree centrality (DC I ) To survey the authority of posters in network structure characters in all posts of a i . Constants 2500 and 1000, respectively, are the number of the types of Class I and Class II Chinese characters; 0.9 and 0.1 are weighting parameters and empirical values.

Term Nonnormalization.
Term nonnormalization is to survey the use of uncivil words by learners (Internet users). Such usage involves impolite, violent, and vulgar words and some Internet slangs in the process of exchange in forums. us, to further analyze the normalization of the words use by surveying the use of uncivil words and Internet slangs, the paper defines the index of term nonnormalization of a i as follows: where C � 1.2 as a constant and reflects that uncivil words are more improper than Internet slangs and UnC W Freq(a i ), Net W Freq(a i ), and TotalWFreq(a i ), respectively, are the frequency of uncivil words and frequency of Internet slangs in all posts of a i and the total frequency of words.

Language Elegance.
Language elegance is to survey the use of fixed phrases (including fixed terms, phrases, and idioms) in posts. Although language derives from life, we cannot deny the fact that "individualized teaching" (yin cai shi jiao-因材施教, Chinese idiom) is more concise, refined, and elegant than "adopting different teaching methods for different students" in terms of expression. If similar expressions are frequently used in a post, we may see the vocabulary and language mastery of the poster. Accordingly, the paper observes the language expression ability based on this. e language elegance of a i is calculated as follows: where CW1 Freq(a i ), CW2 Freq(a i ), Idioms Freq(a i ), and Total W Freq(a i ) are the frequency of commonly used class I words, frequency of commonly used class II words, frequency of idioms in all posts of a i , and the total frequency of words, respectively; CW1Type(a i ), CW2Type(a i ), and Idioms Type(a i ), respectively, is the number of the types of Class I and Class II words and idioms; Type 1 Num, Type 2 Num, and Type I Num, respectively, is the total number of the types of Class I and Class II words and idioms; and σ 1 � 0.25 and σ 2 � 0.35 as constants, which are the coefficients from locally optimal solutions obtained through repeated experiments and empirical values. erefore, the index of language expression of a i is calculated as follows: where ϑ j as weighting parameters, j � 1, 2.

Content Quality.
e content quality of posts directly affects the result of interaction in online learning forums. erefore, in the process of identifying superposters, surveying the content quality is very important. In surveying the quality of the contents posted by superposters, the paper mainly focuses on three questions: at is, if a i is a superposter, his/her posts will be considered high-quality, helpful to others in the process of interaction and highly relevant to a topic (rather than spam or meaningless posts) and to have highly professional knowledge points. erefore, the paper evaluates content quality based on the learning collaboration, correlation with the thread, and expertise of content.

Learning Collaboration.
Learning collaboration, mainly used to observe the role of posts and interaction activities in supporting participants to learn, is to survey whether post contents may help others to solve problems herein. e learning collaboration of a i is defined as follows: where HelpPostNum(a i ), TotalPostNum(a i ), and BeneficiaryNum(a i ), respectively, is the number of helpful posts of a i , total number of posts, and the number of the beneficiaries from the posts of a i . e present difficulty is how to confirm whether a post of a i may help others to solve problems. ere is also similar research, including that on manual confirmation, which is time and labour consuming and undesirable for massive corpus, and that on automatic confirmation, which identifies answer-question in forums in combination with rules and forum structure and achieves good results in extraction experiments [21]. In addition, this is also confirmed by vote in forums [22]. With the second method, the paper confirms the data about helpful posts and beneficiaries in line with rules and in a statistical manner.

Correlation with the
read. Correlation with the thread title is the correlation of replies to the topic discussed in the main post. In the process of discussion in online learning forums, learners often post some irrelevant comments about question B in a post where question A is discussed. A superposter must not or rarely do so and should comment in accordance with threads. erefore, the correlation of a reply of a i to topic is calculated as follows: where CorrNum(a i ) and Total Reply Num(a i ), respectively, are the number of the replies of a i considered relevant to the target topic (main post) and the total number of the replies of a i . e correlation between a reply of a i and the target thread is calculated with the cosine value.

Expertise of Content.
Expertise of content is the course knowledge points involved in a post, by which the index of expertise of content in the posts of a i in the C j forum can be calculated as follows: where KnowledgePointNum(a i , C j ), KnowledgePostNum (a i , C j ), and TotalPostNum(a i , C j ), respectively, are the frequency of the knowledge points of C j included in the posts posted by a i , number of the posts which contain at least 1 knowledge point, and the total number of the posts posted in the C j forum. TotalKnowledgePointNum(C j ) is the total frequency of knowledge points of C j in the forum. Each knowledge point which appears for 1 or 0 times is not be counted repeatedly. Ec I(a i ) is the index of educational content of the posts sent by a i in the C j (j � 1, . . . , K) forum. Accordingly, the content quality of a i can be calculated as follows: where μ i , ( i � 1, 2) are weighting parameters.

Superposter Index.
With the MIN-MAX method, the paper normalizes the results of Le Index, Cq Index, and Sns Index. For example, Le Index can be normalized with the following formula to an extent that realizes the result within the range from 0 to 100:

(12)
Similarly, we may obtain NCq Index and NSns Index. In conclusion, superposter index (Super I) is calculated as follows: where α and β are weighting parameters which may be set according to the actual situation. So Algorithm 1 can be described as ALG_Super_1 (see Algorithm 1, ALG_Super_1).

Data Set.
e data in the paper are downloaded from the Q&A forum [23] for online learning course Computer Application Foundation. e dataset includes 7494 subject, 22369 posts, and 6747 participants (including 6712 learners and 35 teachers). Among the 35 teachers, 28 were found to meet the defined conditions of a superposter, through sampling and analysis of the data about their posts (see Table 2). erefore, 28 teachers were considered as superposters and identified with the method proposed in the paper; there were 7 teachers unqualified to be superposters for 4 teachers who posted 2 posts each and 3 teachers who posted 1 post each. In addition, to count the number of knowledge points in posts, the paper constructs an online unified examination knowledge point set based on the Fundamentals of Computer Application [24].

Evaluation.
ere have been no mature and recognized methods of assessing superposter identification. In this section, evaluation is made with the following indexes [11].
e average accuracy of TOP M (Avg P@M) is as follows:

Result Analysis.
With the three feature indexes of the model including language expression (L), content quality (C), and network interaction structure (S), the paper makes a test on the effect of identifying the 28 superposters. rough repeated weighting tests on data, the weighting parameters are set in Table 3, and the results and statistical analysis are described in Tables 4 and 5, respectively. Comparison of the results of our algorithm with the PageRank algorithm (PR) is shown in Tables 5 and 6 and Figures 1-3. According to Table 5, ① the model indexes achieve good results of superposters identification, and the model is very effective in application in the dataset. ② In terms of the identification result achieved by each feature, content quality, which realizes the average accuracy of over 0.9, is considered best. Language expression is just as the N-gram model, in which only 14 superposters are correctly identified among TOP 28, and is considered good. Social network structure, a common index in social network analysis by which 22 are correctly identified among TOP 28, is considered better. ③Although a single feature is unable to perform well, their combinations may realize striking effects: with the combinations like LC, CS, and LCS, 24 superposters are correctly identified among TOP 28. With the combinations like LC and LCS, all of the superposters can be identified among TOP 15, which undoubtedly proves that the feature designs are rational and effective. ④ e single feature L performs averagely, but its combination with other features performs well, especially LC. We are confused about whether this means that the two models are mutually complementary as part of contents in terms of structure. ⑤ Among TOP 28, 14 superposters are identified by L and LS models. is shows that language expression greatly depends on the length of text and is not sensitive to identification of the superposters of short text although consideration has been given to avoidance of this case in design. Since the number of the superposters identified by LS is less than that by S, we are confused about whether this means that L and S have something in common. However, there is a difference between L as a content-based result and S as a graph-based (network interaction) result in structure. We are confused that whether it is or not because the two models with different structures are mutually exclusive, which causes the result to deteriorate. Actually, this is also the case with the CS model, which achieves a result inferior to that C does, which we cannot explain in this study. Particularly, in the event of TOP 51, all of the 28 superposters can be identified. From Table 5, the trend chart for identification results and for average identification results achieved by each feature in different cases can be obtained (see Figures 1 and 2). Compared with the PageRank algorithm, the experiment result (see Tables 5 and 6 and Figures 1-3) of our algorithm (LCS) is better.

Discussion.
In social networks, it is widely believed that out degree is as important as in degree; in the process of testing the weighting parameter of S, we found that when σ � 1, a good result was achieved locally and other values slightly improved; however, this was considered very unbalanced; that is, it only considered the replies to others' posts but neglected others' replies to own posts, leaving it not universal; through observation of data set, we found that there is a difference between the number of threads and the number of replies, especially in relation to the 28 teachers; in consideration of the generality of the model, through Input: the post dataset of online learners, D; training parameter set, P.
(2) For each online learner a i (1) Computes Sns Index(a i ) by using formula (4); (2) Computes Le Index(a i ) by using formula (7); (3) Computes Cq Index(a i ) by formula (11) comprehensive consideration, the paper sets σ � 0.5. Other parameters are set according to the optimum effects achieved by a single feature in experiments.
According to the intermediate results based on content quality, we found that some data were lost, such as the number of beneficiaries and helpful posts, especially as to learning supporters; for example, the posts sent or replies by teachers were related to the course and helpful to learners; therefore, all of the learners who participate in the interaction were beneficiaries, and the posts of the teachers were considered helpful; however, the paper was unable to accurately obtain such information, leading to a loss of related data and affecting the identification of the teachers as superposters (the relationship between recall and accuracy can be seen in Table 6 and Figure 3). Nevertheless, through the model (LCS), all of the 28 superposters can be identified among TOP 51. It can be done by PageRank algorithm at top 571 cases.

Conclusion
rough analysis of the data on the posts sent by learners in online learning forums, the paper proposed a superposter identification model based on characters, words, and network interaction structure. First, through analysis of the network interaction of users based on graph structure, the paper calculated the out-degree centrality and in-degree centrality of each user node in networks, which involved both the interaction breadth and depth of each node, so as to determine its activity and importance in interactive networks. en, learners' language expression was included in the identification framework, including word normalization, term normalization, and language elegancy, by which the normalization of the words and terms used by learners and their basic ability to master language are judged. e thirddimensional feature is most important in online learning forums, i.e., content quality, which includes learning collaboration, correlation with the thread, and expertise of content. An online learning forum is designed to facilitate cooperation between learners and interaction in relation to learning contents. Learning collaboration mainly considers whether a post is helpful to others in study; correlation with the thread is to verify the correlation between a post and the topic discussed therein; expertise of content is to survey whether course knowledge points are included in a post. Accordingly, the three indexes work, respectively, in online learning forums on a targeted basis.
Although there are some deficiencies in the design of the superposter identification model, such as the need of repeated experiments on manual setting of weighting parameters in the process of calculation, a good result was achieved with the proposed method for identifying the preset 28 superposters. Considering that the method is easily realized and involves few calculations, and it is worthy to be applied in practical online learning systems.

Data Availability
Research data related to learners' personal privacy cannot be shared.

Conflicts of Interest
e authors declare that they have no conflicts of interest.