Coword and Cluster Analysis for the Romance of the Three Kingdoms

The Romance of the Three Kingdoms (RTK) is a classical Chinese historical novel by Luo Guanzhong. This paper establishes a research framework of analyzing the novel by utilizing coword and cluster analysis technology. At the beginning, we segment the full text of the novel, extracting the names of historical figures in the RTK novel. Based on the coword analysis, a social network of historical figures is constructed. We calculate several network features and enforce the cluster analysis. In addition, a modified clustering method using edge betweenness is proposed to improve the effect of clustering. Finally, both quantified and visualized results are displayed to confirm our approach.


Introduction
The Romance of the Three Kingdoms, written by Luo Guanzhong, is generally considered to be one of the four great classical novels in Chinese literature. It describes the turbulent years from the end of the Han dynasty to the Three Kingdoms (Wei, Shu, and Wu) era in Chinese history. More than 1000 personalities are vividly portrayed in the historical novel.
In this research, text of original novel is divided into a number of sentences. According to coword analysis, there is a certain intrinsic relationship between the two words when they appear in the same document. Thus, we calculated the frequency of cooccurrences for two names in a sentence. The character name is reckoned as the node and the cooccurrence as the link, so that an undirected network can be established. Furthermore, various network features are computed to analyze relationships of characters in the novel. Cluster analysis is employed to explore the hierarchical structure of RTK. Finally, an improved clustering algorithm by cutting high-betweenness edges is proposed, which performs better than the common approach in clustering effect.
This manuscript is organized as follows. Section 2 gives related work of this paper. Data preparation is discussed in Section 3. Sections 4 and 5 express the network feature anal-ysis, cluster analysis, experiments, and the analysis of results. Conclusions are drawn in Section 6.

Related Work
Early research about the RTK concentrates on qualitative analysis, such as the writing style, genealogy, and characters. Later, a quantitative approach was adopted to analyze the novel. Coword analysis is such a method of importance, which was first devised by French scholars and introduced into the information science field by Callon [1]. According to the theory of coword analysis, there is a close connection between two words when they appear in a sentence. More cooccurrences of the two words indicate the closer relationship between them. In this paper, we consider the cooccurrence of character names in a sentence of the RTK novel.
Numerous researches on literature analysis have been done based on the technologies of coword analysis. Ravikumar et al. [2] inspect 959 articles in scientometrics based on the coword analysis approach and find that the topics in publication are changing to new themes. As for the medical literature, there is a study utilizing this tool to process them over a span of thirty years [3]. Another work focuses on past themes and future trends in medical tourism research [4]. Employing the coword analysis, some researchers attempt to identify the themes and trends of main knowledge areas including engineering, health, public administration, and management [5]. Moreover, a coword network is established to analyze the relationship of characters in the Dream of the Red Chamber [6]. Wang et al. build a similar network for the Romance of the Three Kingdoms [7].
After creating a social network based on coword analysis, the cluster analysis is carried out by performing a hierarchical clustering algorithm. Two types of algorithm are often implemented when moving up the hierarchy. The divisive approach of clustering reckons all data as one cluster and performs splits, which is used in many research [8]. Nevertheless, the agglomerative hierarchical clustering is a bottom-up method with many variants [9]. It merges the two most similar clusters at each time. The agglomerative method is exploited in this work because it can provide a visual expression of the clustering results.

Data Preparation
3.1. Building RTK Corpus and Preprocessing. As many data of the novel can be downloaded from the Internet, we selected a high-quality text document (https://72k.us/file/22215238-408791478) in Chinese character, establishing the RTK corpus by cleaning the original data. Some words with errors were modified, and the wrong punctuations were removed manually.
The raw text is preprocessed using the natural language processing toolkit ICTCLAS (http://ictclas.nlpir.org/). We acquired a name list of RTK characters through the Internet and added it to the dictionary of ICTCLAS. Then, the lexical analysis is executed to segment Chinese sentences into words where names of characters can be found.

Creation of Character Name
Network. Based on coword analysis, an undirected network of character names can be created by counting the cooccurrences of two names in sentences. We treated full name, its courtesy name, and abbreviated name as one name. For example, "Cao Cao" is equal to "Cao Mengde" and "Mengde," which means the three names refer to a single person of "Cao Cao." The final constructed network of character names has 1,133 nodes and 5,844 links. As depicted in Figure 1, the size of a node indicates the count of the character name in the novel and the thickness of a link corresponds to the frequency of two characters that appear together.

Network Feature Analysis
4.1. Degree Distribution. As the degree of a node is the number of links adjacent to it, the degree distribution is the probability distribution of these degrees. A power index γ can be used to describe the curve if the network's degree distribution follows a power-law distribution.
For the network of RTK characters, the top ten characters of the highest degree are Cao Cao, Liu Bei, Zhuge Liang, Sun Quan, Zhao Yun, Guan Yu, Yuan Shao, Sima Yi, Lv Bu, and Wei Yan. The average degree of the network is 10.31, and the degree distribution can be illustrated in Figure 2. It emerges  Wireless Communications and Mobile Computing to be a heavy-tailed distribution (see Figure 2(a)). As the data can be approximated with a linear function y = −1:2864x + 2:5654 on a log-log scale in Figure 2(b), we conclude that the degree distribution follows a power-law distribution.

Average
Shortest-Path Length. The shortest path between two nodes is a path where the number of links is minimized. Accordingly, the length of the shortest path is the number of links that the path contains. A sum of all shortest-path length divided by the number of links is the average shortest-path length. The average shortest-path length of the RTK network is 3.1743. Hence, one character can be connected to others in three steps on average, which means any two characters are "three-degree separation." The distance of the largest shortest path in the network is called diameter. In this paper, the RTK network's diameter is 9. One path of the diameter is from Liu Ai to Zhang Shang: Liu Ai, Wang Li, Dong Zhao, Cao Hong, Cao Cao, Sima Yan, Yang Hu, Du Yu, Lu Jing, and Zhang Shang. The distribution of the shortest-path length between any two characters can be illuminated in Figure 3. According to the figure, 47.63% of the shortest-path length in the RTK network is 3 and about 92.15% is between length 2 and length 4.

Clustering Coefficient.
A clustering coefficient [10,11] measures the extent to which a network's nodes tend to cluster together. The clustering coefficient of node x can be given by E x is the existing links among neighbors of node x. As k x is a degree of node x, ð1/2Þk x ðk x − 1Þ represents the number of potential links for node x's neighbors. Therefore, the average value for all C x is the clustering coefficient of the whole network.
A random network is produced by an Erdős-Rényi (ER) model utilizing the same number of nodes and links as the RTK network. The comparison between random network   Figure 3: Distribution of shortest-path length.

Wireless Communications and Mobile Computing
and RTK network is shown in Table 1. The RTK network is a small-world network because it has a larger clustering coefficient as well as a smaller average shortest-path length compared with a random network.
We choose the characters who clearly belong to the three groups of Wei, Shu, and Wu and calculate the network features of the three kingdoms, respectively. The results are summarized in Table 2.
The character relationship networks within three groups have high clustering coefficients and small average shortestpath lengths. Consequently, all of the three subnetworks are "small-world" networks. From the Shu to Wu and Wei, the density and clustering coefficient of the subnetworks decrease sequentially except for the clustering coefficient of Wu. On the contrary, the average shortest-path length and diameter increase successively. This reflects a decrease in the closeness of the connections among the groups. In other words, the connections among characters in Wei are less closely than Wu and Shu.

Density.
The density of a network shows the ratio of links, which can be simply calculated by formula (3). N and E are the number of nodes and links. It describes the portion of all possible links in a network that are actual connections.
The value is a fraction between 0 and 1. As the density of the RTK network is 0.0091, it is a sparse network.
4.5. Centrality. The centrality measures the importance of nodes, containing degree centrality, betweenness centrality, and closeness centrality. Degree centrality is a measure of centrality based on degree. A high-degree node is a local center within the network. Betweenness centrality expresses the extent that the node falls on the shortest path between other pairs of nodes. A node with a high betweenness is capable of controlling the interactions between two nonadjacent nodes [5]. Closeness centrality is a measure of the average shortest distance from each node to each other node. It evaluates the closeness that a node is to all the other nodes [3].
Three centralities of characters in the RTK network are calculated, respectively. Table 3 gives the top ten characters of the highest centrality. The value of centrality is listed in parentheses. From Table 3, we can find eight names listed in three centralities: Cao Cao, Liu Bei, Zhuge Liang, Sun

Cooccurrence and Similarity
Matrix. The cooccurrence matrix measures the frequency that two characters appear together. A cooccurrence matrix of main characters in the RTK network is presented in Table 4. It is a symmetric matrix, and data on the diagonal show the frequencies of characters that appear in text. The cooccurrence of two characters cannot be used as the similarity because it is greatly affected by frequency. We normalize the cooccurrence matrix utilizing the Ochiai coefficient [12] and obtain the similarity matrix. Ochiai coefficient is defined by As A and B are sets, nðAÞ is the number of elements in A and nðA ∩ BÞ is the number of cooccurrence. The similarity matrix calculated by the Ochiai coefficient is described in Table 5.     Algorithm. An agglomerative hierarchical clustering algorithm utilizing the Ochiai similarity matrix is implemented to complete the task of cluster analysis. It is a bottom-up approach. Initially, each node is treated as a single cluster. Two clusters with the largest Ochiai similarity are combined into a new bigger cluster. The clustering algorithm stops when it achieves a setting threshold or there is only one cluster left. The similarity between two clusters is defined as the average similarity between each of their nodes.

Evaluation.
The P-IP scores [13] are adopted to measure the clustering result. There are m character names and n clusters. Suppose C ij is the number of character names marked with label j for character name i, where j = arg k max k=1,2,⋯,n fC ik g. The precision and recall of character name i can be given by Thus, the F score is calculated by The overall precision, recall, and F score are the averages of corresponding values. Moreover, the gold standard is built by marking the character name with a specific kingdom tag. For example, Cao Cao is tagged with "Wei" and Liu Bei is tagged with "Shu." Finally, 308 character names with definite kingdom tags are secured for cluster analysis.

Clustering
Result. The result of hierarchical clustering is illustrated in Table 6. The F score achieves the best value of 79.89% when the number of clusters k is 15.

Improved Clustering Algorithm.
In the RTK network, some characters play a vital role in interconnections of different kingdoms, like "Lu Su" between Wu and Shu, "Huang Gai" between Wu and Wei. These characters have a high betweenness according to the definition of betweenness (see Section 4.5). Further, the node betweenness can be extended to "edge betweenness" [14]. The link with a high edge betweenness is often a bridge between different clusters (see red link in Figure 4). Therefore, removing these highbetweenness links by setting a similarity of 0 will reduce the intercluster similarity and improve the clustering result eventually. The removal operation can be introduced as preprocessing before conducting the cluster analysis.
The improved clustering algorithm using edge betweenness is executed, and the result is displayed in Table 7. When the number of removals is zero, it is the baseline of the original algorithm. With an adequate removing operation, the F score reaches a peak of 80.87%. Nevertheless, removing too many links will destroy the whole network and make the F score decline dramatically (see Figure 5).

Analysis.
Data visualization is also given to display the characteristics of historical figures in the RTK network. As hierarchical clustering can be depicted as a tree-based visual dendrogram, we visualize the character relationship in the RTK novel from Chapter 43 to 50, which is a period describing "the battle of Red Cliffs" (see Figure 6).
As can be seen from Figure 6, six parts can be divided manually. H1 and H3 are groups containing characters from "Wu," like Sun Quan and Sun Ce. H2 encompasses main characters from "Shu" and "Wu" in the battle of Red Cliffs: Liu Bei, Guan Yu, Zhuge Liang, Zhou Yu, Lu Su, etc. However, there are two exceptions: Cao Cao and Cheng Yu, because they are highly connected with other main characters in the battle of Red Cliffs. Further, H1, H3, and H2 merge into a bigger cluster in the hierarchical clustering because these characters are from the alliance of "Wu" and "Shu" against Cao's army.
On the other hand, H5 is composed of characters from a large group "Wei," including Xiahou Dun, Xiahou Yuan, Cao Ren, and Cao Hong. H6 includes few characters from "Shu"   Wireless Communications and Mobile Computing or "Wu." H4 is not a cluster, and it contains a number of characters from different kingdoms.

Conclusions
This paper developed a general framework for analyzing the character relationship in the novel. The Romance of the Three Kingdoms is taken as the object of analysis. At first, the raw text of the RTK novel is processed with NLP tools and character names are recognized by lexical analysis. Then, a character name network is created based on coword analysis. After building the network, several network features are calculated such as degree distribution, average shortestpath length, and clustering coefficient. Besides, cluster analysis is conducted and it helps to better understanding of the hierarchical structure for characters in the RTK novel. A modified clustering algorithm using edge betweenness is proposed to improve the effect of clustering. Finally, visualization of results is completed to analyze the hierarchical clustering.
There are some limitations of the proposed method since coword analysis does not necessarily reflect the true meaning of character relationship. However, our approach can study the main characters quantitatively and comprehend character relationship from another perspective. Hence, it is a valuable research direction. Subsequent work will study the meaning of pronouns because they represent different characters in different situations. Further, place names and institutions will be taken into consideration in the future.

Data Availability
The original dataset used in this work is available from the corresponding author on request.