Understanding the Dynamics of Knowledge Building Process in Online Knowledge-Sharing Platform: A Structural Analysis of Zhihu Tag Network

)rough structural analysis of 8-year tag networks from online knowledge-sharing platforms, this study finds that, with the scale of tag networks growing quickly, the growth trend of number edges indicates that tag network follows densification law. )e clustering coefficient and the average shortest path of the network show that the rapid growth of network size does not bring about the compartmentalization of the knowledge network, and the degree distribution of tag networks shows a truncated power-law distribution. According to the structural characteristics of the tag network, this study proposes a tag network model based on the BAmodel. Based on the preference attachment, the triadic closure mechanism is employed to construct the edges between the old nodes, which revises the limitation that the BA model only connects edges between old and new nodes. )e results show that the simulation model matches the actual tag network structure well. )e generation mechanism of the tag network model provides a reference for understanding the knowledge construction process of the online knowledge-sharing platform to a certain extent.


Introduction
With the success of crowdsourced platforms, such as Wikipedia, Stack Overflow, Quora, and GitHub, a class of researchers are driven toward understanding the dynamics of knowledge building on these platforms [1]. e online knowledge-sharing system is an "all ask all" knowledge-building system, where users spontaneously ask and answer questions, and most of the platforms offer the opportunity for users to add knowledge tags to questions in the process of asking.
Compared with the traditional expert-driven knowledge production mode, the users' self-organized knowledge construction process is more distributed and diversified, which generates different driving forces for the development of the knowledge domain. e investigation of the dynamic development of knowledge network is an important basic work to understand the process of knowledge production and construction, and it will help in the exploration of the development trend of knowledge domain, knowledge innovation, and other issues [2].
Although collaborative knowledge building platforms are the most popular knowledge product mode, limited research has been done from the perspective of knowledge tag networks. Being composed of a set of concepts and interrelationships, knowledge can be effectively represented in terms of a network or a graph [3]. erefore, our research aimed to explain the online knowledge building process from the perspective of network analysis and provide a reasonable explanation of the mechanism of knowledge network generation.

Related Works
To explore the development mechanism and research trends of knowledge domains, numerous researchers construct keyword cooccurrence networks using keywords of articles as knowledge elements and then analyze the characteristics and evolution of networks. As important components of text content, keywords can present the core idea of academic articles. e analysis of topological features and evolution of keyword cooccurrence networks help in the realization of indepth content analysis [4]. Zhu et al. [5] observed that the keyword networks are small worlds based on the network clustering coefficient and average distance among reachable nodes in a network, and the betweenness centrality is used to conduct preliminary studies on how to detect research hotspots of a discipline based on keyword networks. Yi and Choi [6] used the keywords of articles as an alternative knowledge element and studied the structural characteristics of keyword networks to better understand scientific knowledge. By observing keyword networks snapshots over time, Behrouzi et al. [7] utilized link prediction methods to foresee the future structures of these networks, which help in the prediction of future scientific research trends. e knowledge tags provide a concise overview of the important content and key points of a question. e relationship between knowledge tags and questions is equivalent to that between keywords and articles. Several research works focused on tag networks with increased data availability on social media platforms. Feng et al. [8] conducted a structural analysis of knowledge tag network based on the motif structure and observed that the core nodes of the tag network have a strong attraction to other nodes; thus, a large number of knowledge tags are distributed at the periphery of very few center tags. Zhang et al. [9] described the content characteristics of the knowledge frontier of online knowledge-sharing platforms based on the theory of collaborative knowledge construction and assessed the knowledge frontier inclusiveness of online knowledge-sharing platforms. Chen and Xing [10] proposed an approach to automatically mine technology landscape from Stack Overflow question tags. Structured knowledge of technologies can emerge from the tagging practices of millions of online users considered together.
Although tag data in the online platform have been studied for a long time, a limited number of works aim to investigate the knowledge-building process through the evolution of tag networks. Many of the previous studies on the tag networks of question-and-answer (Q&A) platforms focus on tag recommendations [10,11], whereas the mechanisms of knowledge tag network generation are not explored. e collaborative knowledge building of the online knowledge-sharing system forms a Collaborative Authoring Environment for online community participation by multiple people through the "question-answer" form, which is an important way for the public to exchange, share, and collaborate on knowledge. Users can browse, discuss, and produce content freely and openly and can create new questions and tags by asking questions, enabling rapid growth of knowledge in the system. Knowledge collaborative building based on an online knowledge-sharing system is the process of coevolution of individual knowledge and group knowledge [13] and realizes knowledge building in the modern sense. e generation of the knowledge tag network is the result of user knowledge collaborative production. By adding new knowledge tags to the system and adding associations between tags, users support the knowledge network to continue to develop. e change in the knowledge tag network reflects the changes in the audience's knowledge concerns and the dynamic process of knowledge evolution. e investigation of the mechanism of such networks can provide references for understanding knowledge production and predicting knowledge development trends.
To explain the mechanisms of knowledge-tag network generation, the degree distribution can be an important reference property. If the log-distribution of the node degree follows a power law [14], then the graph is a scale-free network [15], which can be found in numerous complex phenomena in the real world. Barabasi and Albert proposed the BA model [16], and they suggested that the power-law distribution of degree is a consequence of two generic mechanisms: (i) continuous network expansion by the addition of new nodes and (ii) newly arriving nodes tending to connect with already well-connected nodes, known as "preferential attachment." Although the BA model is one of the most classic and applicable models available [17][18][19], and there are many, it still has several limitations that are inapplicable compared with numerous real-world networks. e actual network often has certain non-power-law characteristics, such as exponential truncation and small variable saturation [20]. A number of authors have subsequently published more extensive simulation results based on the BA model. e improvement of the BA model can be approximately divided into two directions. One way to improve the BA model is to add new information dimensions to conform to different realistic systems. Another way is that the mechanism of connecting edges is slightly adjusted to conform to the diversity systems' characteristics.
First, the simulation rules could be adjusted by introducing new variables or parameters into the model. For example, Bianconi and Barabási [21] proposed a fitness model that reflects the basic properties of most real systems, in which the nodes compete for links with other nodes; thus, a node can acquire links only at the expense of the other nodes. Xiang and Zhao proposed a modified BA model in which connecting decisions of new nodes are motivated by different proximities [22]. e simulation results showed that degree distribution still follows power laws, and the peripheral nodes are less dependent on core actors in accessing external knowledge.
Second, different from the BA model, which only add edges when a new node arrives, numerous networks where new nodes are added are found in the real world, whereas new connections are made between existing nodes inside the network. Although the degree distributions of nodes in these networks also take on a power-law-like form, their generative patterns are more complex than those of BA models. For example, to adjust the connecting rule [23] proposed models of developing and decaying networks with undirected links showing scaling behaviors. In addition to new links connecting new sites and old ones, links between old sites may appear or break [24]. is also involves extending the BA model by allowing the number of newly added links 2 Complexity to be random, under some mild assumptions on its distribution law. e modified model can create new nodes with a high degree at any iteration, which seem to be capable of simulating the temporal behavior of real networks more realistically.
Following the second improvement approach, one wellknown change in the connecting edges mechanism is that Holme and Kim extend the standard scale-free network model to include a "triad formation step" [25]. ey formulated that when a new node v is added to the network, v will connect to an old node, m, and, following the preference attachment mechanism, node v has probability P to connect to the neighbors of m. ey found that, with the BA model and triadic closure mechanism, this model possessed the same characteristics as the standard scale-free networks, like the power-law degree distribution and the small average geodesic length, but with high clustering at the same time.
Triadic closure is a natural mechanism to make new connections, especially in social networks [26]. Suppose that two people have a mutual friend in the social network; the likelihood of them becoming friends in the future increases.
is mechanism has been reported as the most common structural constraint [27]. It can explain many salient features of empirical social networks, including numerous closed triangles between acquaintances and fat-tailed degree distributions [28]. is mechanism brings dense network edge connection and can be one of the reasons for the network community structure. As a keyword network clustered by topic, the tag network also has a prominent community structure. e edge connection feature of the practical principle of the tag network may also be in line with the triadic closure mechanism. For example, when tag A is connected to tag B and tag C simultaneously, tag B and tag C are also more likely to be associated in the perspective of semantic dimension.
In this research, we will first describe the essential structural characteristics of the tag network, which is constructed by data from an online knowledge-sharing platform. We proposed a tag network simulation model based on the BA model and triadic closure mechanism. For complex and large-scale networks, the exploration of network characteristics and the simulation of the generative mechanisms are significant. e network analysis results reveal the evolutionary features of the knowledge network and help us understand the process of online knowledge building.

Data.
Zhihu is the largest online Q&A platform in China [9]. e same as most Q&A platforms, Zhihu allows users to add multiple tags to their questions, similar to the keywords in an article (see in Figure 1). Users can add tags that they have built themselves or select old tags that have already been built by other users. In the analysis of knowledge networks, the multiple tags appearing in the same question could be considered to have cooccurrence relationships [3].
is study used 74,761 tags contained in 1,520,254 questions from January 1, 2011, to December 31, 2018, in Zhihu. e number of tags and the cooccurrence relationships are cumulatively calculated every two months, and the line shows that the cooccurrence relationships have an obvious increasing trend in the last few months (see Figure 2).
To demonstrate the dynamic development process of the network, this study first splits 8 years of data into 48 time periods based on a time window of 2 months. In each time period, the tags appearing in the same problem were connected to build an undirected tag network. Finally, 48 network slices in total were constructed.

Tag Network Characteristics.
First, this study calculates the network characteristics of the Zhihu tag network. We counted the number of nodes, the number of edges, clustering coefficient, and average shortest path of the largest connected component for each network slice ( Table 1). As shown in Figures 3(a) and 3(b), the tag network size developed slowly in the first 10 time periods, and the number of nodes exhibited a significant upward trend around the 10th to 20th time periods. e number of nodes was relatively stable in the 30th to 40th time periods, followed by a very sharp increase in the 40th period. e number of nodes fell back in the last two time periods. ese trends were consistent with the business strategy and development of Zhihu in China. From Figures 3(c) and 3(d), the tag network is a relatively dense network. Although the node size of the network is increasing, the network does not become compartmentalized as a result, and the average shortest path and clustering coefficients of the network remained at a relatively stable level, regardless of the slow decline.
In addition, Leskovec et al. [29] observed that as a network grows, its diameter decreases over time, suggesting that the network "shrinks" or becomes denser, which challenged the existing belief that online social networks evolve with a constant average degree and a slowly growing diameter. We calculated the effective diameter of each network slice. e network diameter is the maximum node distance [30]. Numerous real networks have small diameters, indicating small worlds. However, diameter is not always the best metric, because it is difficult to compute, and it is prone to outlier effects [31]. us, the effective diameter of each network slice was calculated (Table 1). A given natural number d represents the effective diameter of the network when the ratio of the shortest paths between pairs of nodes in the network is less than or equal to d reaching 0.9 [29]. As shown in Figure 4, the effective diameter of the network slices showed a slowly decreasing trend.

Complexity
In real world, most systems experience a slow decrease in diameter due to the rapid growth in the number of edges. e growth of the numbers of nodes and edges shows a power function relationship (e(t) ∝ n(t) α ). Leskovec et al. [29] called this phenomenon densification law. Here, we constructed a complete network with 8 years of data and counted the numbers of nodes and edges of the network every two months. Figure 2 shows the cumulative numbers of nodes and edges of the network. Using the cumulative number of nodes at each moment as the horizontal coordinate and the cumulative number of edges as the vertical coordinate, the numbers of nodes and edges of the network almost approached a straight line in double logarithmic coordinates (see Figure 5). A linear regression model fitted to the scattered points yielded a slope of 1.66, an intercept of −1.92, and a goodness of fit of 0.98. Figure 6 shows the overall degree distribution of the network with 74,761 nodes in the double logarithmic coordinate. e horizontal axis denotes the value of degree, and the vertical axis represents the frequency of degree. e result of degree distribution showed that, as a whole, the growth of the tag network may be roughly in line with the preference attachment mechanism; that is, the nodes that newly joined the network had a higher probability to connect to the large-degree nodes, which led to the eventual appearance of the "rich get richer" power-law distribution of the network. However, the degree distribution of the tag network deviated from the power-law distribution at the tail end, which reflects that after the node scale of the network grew to a certain extent, the growth of edges of large-degree nodes was close to saturation, and the growth of their edges was limited. We counted the degree distribution during the growth of network on a bimonthly basis and screened out the top 100 nodes in degree rank. A total of 232 large-scale nodes were screened out at 48 time points. Next, we screened out a total of 28 nodes which ranked in the top 100 from the beginning of the network until the 48th time point. As shown in Table 2, these nodes are some broad and abstract concepts, such as "life," "movie," "law," "educate," and "psychology." We fitted the tag network degree distribution by the power-law package of Python. Figure 7 shows that the degree distribution fit of the network was closer to a truncated power-law distribution (the red line) than to a standard power-law distribution (the blue line). Truncated power-law distribution is a common alternative to the asymptotic power-law distribution because it naturally captures finitesize effects [32]. Several measured social networks do not follow a power-law degree distribution [33] and are best fitted by an exponentially truncated power-law distribution. Clauset et al. [34] gave the basic truncated power-law functional form f(k) (equation (1)) and the appropriate normalization constant C (equation (2)) such that ∞ x�x min Cf(k)dx � 1 for the continuous case. e distribution is P(k) � Cf(k), where k is the degree of node. e fitting results showed that the degree distribution of the tag network fitted the truncated power-law distribution with α � 2.12 and λ � 0.0003.

4.1.
Model. e BA model is a classic model in the field of complex networks, and its simple mechanism can explain the power-law phenomenon in real networks; many simulation research of realistic mechanisms are based on the BA model [35,36]. e two basic mechanisms of the BA model are as follows: (1) new nodes are constantly added to the network, and (2) the newly added nodes are more inclined to connect with large-degree nodes. However, for the knowledge tag network, such a mechanism is different from the actual knowledge tag production process. First, from the perspective of knowledge building, new knowledge will be continuously produced, and new associations will also be generated between existing old knowledge concepts in the knowledge space. erefore, for a tag network, the connection between nodes is generated when a new node joins and also between old nodes at any time. Regarding how to construct connected edges between old nodes, this study draws on the mechanism of triadic closure. For node A, if nodes B and C are both neighbors of A but B and C are not connected, it is likely that connected edges will be generated between B and C in the subsequent moments. Second, although a "preferential attachment" exists in the tag network, that is, large-degree nodes (such as the more commonly used concepts with broader semantics) are more easily connected, when the network scale increases to a certain extent, this advantage will gradually weaken. In view of the shortcomings of the BA model and the knowledge building Complexity process, this paper proposed a tag network model based on the BA model and aimed to present the knowledge of the tag network generative mechanism of online knowledge-sharing platforms.
Based on these features, the specific algorithm of model generation is as follows (see Figure 8): Step 1: a single node without edges exists in the initial network.
Step 2: the action will be selected between "add new node" and "add edge of old nodes" based on the current probability P. at is, there is the probability P to "add new node" and probability (1 − P) to "add edge of old nodes." P is a function about the numbers of nodes and edges of the network (P � f (n, e), where n is the number of nodes, and e is the number of edges). e details of this function will be explained later.

Complexity
Step 3: if the action "add new node" is selected, then a new node v is added, and a node m will be selected based on its degree from the current G. e larger the node degree is, the easier it is to be selected. Edges are then added to nodes v and m. Otherwise, if the action is "add edge of old nodes," the algorithm will also select a node m based on its degree from the current G. en, a node mn will be selected from the second-order neighbors of m, which is not connected with m. Edges are added to nodes m and mn afterward.
Step 4: steps 2 and 3 are repeated the until the number of nodes reaches the target number N.
In this algorithm, the probability P determines whether the current time step will add new nodes or edges to the network. P is the probability of adding new nodes, and 1 − P is the probability of adding edges between existing nodes.
rough the observation of the growth of the numbers of nodes and edges in the actual network, the growth of the number of edges in the network was mainly affected by the number of existing nodes and edges in the current network. e probability P should be correlated with the density in the current network. erefore, we constructed a probability Pt + 1, which is about the new node addition probability to    (2) In this study, the values of parameters a and b in formula (2) are obtained by fitting the actual data. We create a loss function floss that calculates the difference between the number of edges in a simulated network at parameters a and b and the number of edges in an actual network of the same network size. e floss is shown in formula (3), and t is the index of every two months. E t is the actual number of edges in time t, E' t is the number of edges in simulation network at the current time t, and logE t is used as the denominator to balance out the impact of network size growth. e smaller the loss function is, the closer the performance of the simulated network is to the connection of the actual network, so the optimal solutions for parameters a and b can be obtained. Here we limit the range of a to [1,15] and the range of b to [−10, 10] and then use the bisection method to find the optimal solution of b in the case of traversing a with a step length of 0.001.
Furthermore, the degree growth of large-degree nodes in the tag network is not infinite. erefore, when the degree of a concept has reached a certain threshold during model construction, its advantage in the degree-based preferential attachment mechanism needs to be weakened. Here, we assumed that when the degree of a node in the network reaches threshold H, the node with probability p h does not increase its degree when calculating the current selected probability.

Simulation
Results. By constructing the model mechanism proposed in Section 4.1, this study simulated the network generation process in which the network grew from an initial 1 node to 74,761 nodes. In this model, we fixed the parameters that determine the probability P as a � 5.51 and b � −0.19. In addition, given the situation of the tag network, we applied the degree threshold values of H � 2000 and p h � 0.69. Figure 9 shows the growth of the numbers of nodes and edges in the simulated network. e horizontal and vertical coordinates are the numbers of nodes and edges, respectively. With the passage of time steps, the distributions of the numbers of nodes and edges in the double logarithmic coordinate can be fitted by a straight line. e fitting slope was 1.68, the intercept was −2.01, and the goodness of fit was R 2 � 0.96. us, the power function relationship (e(t) ∝ n(t) α ) between the numbers of nodes and edges was maintained in the simulated network, and the fitting slope and intercept were relatively close to the fitting results of the actual network. Figure 10 displays the degree distribution of the simulated network, which is similar to the degree distribution image of the real network. In the double logarithmic coordinate, the degree distribution of the simulated network was a truncated power-law distribution with heavy tail,      (2019), Heterogeneous cooperative leadership structure emerging from random regular graphs, Chaos, vol.29, pp.103103)). When we fitted the degree distribution image (see in Figure 11), the degree distribution of the simulated network was closer to the truncated powerlaw distribution (red line). In addition, the fitting parameters α � 2.07 and λ � 0.0003 were very close to the actual fitting parameters of the tag network (see in Figure 12).

Conclusion
Despite the enormous and recent interest in large-scale network data and the range of interesting patterns identified for static snapshots of graphs, relatively little work has been conducted on the properties of the time evolution of real graphs [29]. is paper introduced the knowledge tag network characteristics of online knowledge-sharing platform and simulated the generation model based on the classic BA model.
First, the results showed that the tag network exhibits a very rapid growth in scale, but it is not a fragmentation with a rapidly growing number of nodes. On the contrary, the effective diameter and clustering coefficient of the network showed a slowly declining trend. e edges of the network were very dense, and the numbers of nodes and edges of the network showed a relationship close to a power function over time, which indicates that the tag network followed the densification law.
Second, the degree distribution of the tag network followed a truncated power-law distribution. e link mechanism in tag networks also followed the "rich get richer" preference attachment mechanism. In the tag network, the edges among nodes imply that those knowledge concepts are semantically related, and the nodes with large degree are often generalized and broad concepts. erefore, with the development of tag networks, the degree advantage of the addition of an edge stage will weaken, which explains why the degree distribution of the knowledge tag network was closer to the truncated power-law distribution than to the power-law distribution. e fitting results showed that the degree distribution of tag network fitted the truncated power-law distribution with α � 2.12 and λ � 0.0003. en, this study proposed a network generation model applicable to tag networks. e model is based on the BA model with the addition of the edge linkage mechanism between old nodes, which is more consistent with the actual process of knowledge building and can make the network generate a dense network structure. A truncated power-law fit of the node degree distribution of the simulated network was obtained with α � 2.07 and λ � 0.0003, which were close to those of the real network degree distribution. erefore, the simulation model proposed in this study can explain the growth mechanism of the real tag network to a certain extent.
Finally, this study investigated the mechanism of tag network generation in online knowledge platforms, and this work will help us deepen our knowledge and understanding of the online knowledge building process. e model proposed in this study can provide a general adaptation to the current online knowledge-sharing platform by adjusting model parameters. is model uses probability P to balance the relationship between network edge addition and point addition. e parameters of P come from the fitting of actual data, which means that the function has strong expansibility and can be simulated according to the historical data of different platforms. In addition, the parameters of the P function in the model are determined by fitting the historical data of the network, and, in the process of searching for parameters, the algorithm uses the idea of binary search to reduce the time complexity significantly while maintaining the accuracy of the results. Even large-scale network data can be calculated in a relatively brief time. ese characteristics make the model utility to adapt to different data platforms. It is useful to predict the growth scale of tag networks in the future based on the information of the network at present and provide a reasonable reference for the future knowledge platform construction.
In the future, the research on the generation mechanism of the tag network should be expanded from two dimensions. One is to broaden the research platform and research objects. e other is to use the simulation network as the basic framework to carry out the research on the network structure and network efficiency by the generation mechanism. e research platform and research objects should not be limited to the knowledge tags and the keywords. Topic and many other texts content also have research value. In addition, future research could pay attention to the information communication effect or other problems combined with the network generation mechanism, which could give an in-depth understanding of the relationship between network structure and network function [37].

Data Availability
e data used to support the findings of this study are available from the corresponding author upon reasonable request.