A Study on Differences between Simplified and Traditional Chinese Based on Complex Network Analysis of the Word Co-Occurrence Networks

Currently, most work on comparing differences between simplified and traditional Chinese only focuses on the character or lexical level, without taking the global differences into consideration. In order to solve this problem, this paper proposes to use complex network analysis of word co-occurrence networks, which have been successfully applied to the language analysis research and can tackle global characters and explore the differences between simplified and traditional Chinese. Specially, we first constructed a word co-occurrence network for simplified and traditional Chinese using selected news corpora. Then, the complex network analysis methods were performed, including network statistics analysis, kernel lexicon comparison, and motif analysis, to gain a global understanding of these networks. After that, the networks were compared based on the properties obtained. Through comparison, we can obtain three interesting results: first, the co-occurrence networks of simplified Chinese and traditional Chinese are both small-world and scale-free networks. However, given the same corpus size, the co-occurrence networks of traditional Chinese tend to have more nodes, which may be due to a large number of one-to-many character/word mappings from simplified Chinese to traditional Chinese; second, since traditional Chinese retains more ancient Chinese words and uses fewer weak verbs, the traditional Chinese kernel lexicons have more entries than the simplified Chinese kernel lexicons; third, motif analysis shows that there is no difference between the simplified Chinese network and the corresponding traditional Chinese network, which means that simplified and traditional Chinese are semantically consistent.


Introduction
Chinese is usually written in two forms: simplified Chinese (mainly used in Mainland China and Singapore) and traditional Chinese (mainly used in Hong Kong, Macao, and Taiwan). Although simplified Chinese is derived from traditional Chinese, the two systems are quite different on various levels, such as character set, encoding method, orthography, vocabulary, and semantics, which create barriers to communication between different areas where Chinese is spoken.
is linguistic phenomenon is due to the independent development of these two homologous systems in the past half century, and they will continue to evolve in their respective cultural environments. However, in the past few decades, with the increase in exchange activities between four cross-strait regions, the problem of conversion between simplified Chinese and traditional Chinese as well as the comparison of the differences between simplified Chinese and traditional Chinese has attracted the attention of more and more researchers [1][2][3][4]. In short, the comparison between Simplified Chinese and Traditional Chinese has important reference value for the study of language evolution. So far, research on comparing differences between these two forms of Chinese still focuses on the character or lexical levels [1,3,5]. For example, Fei [6] made a systematic comparison of the similarities and differences of the current Chinese characters in simplified and traditional Chinese characters; Li [7] made an in-depth analysis of the reasons for the differences in the form of simplified and traditional Chinese characters from the aspects of politics, history and culture, and the principles of character selection; Liu [8] conducted a comprehensive analysis mainly from the perspective of eliminating the differences in form; Jiang [9] mainly compared and analyzed simplified and traditional Chinese vocabulary from two aspects: homographs with different meanings and different forms with synonymous meanings; Li and Qiu [10] discussed the causes, types, and processing methods of differences in dictionaries across the Taiwan Strait.
is is because language is a typical hierarchical system which has a highly complex network structure, and complex network analysis methods have the advantage of revealing the laws of language as a whole. Hence, in this paper, we apply complex network analysis methods to explore the differences between simplified and traditional Chinese character systems from a holistic perspective. Specially, according to the construction method of the word co-occurrence network, this paper proposed to construct simplified Chinese and traditional Chinese word co-occurrence networks with different numbers of nodes and different corpus sizes and then make corresponding research on the complex characteristics of these networks. rough the obtained simplified and traditional Chinese core dictionary, we explored the differences between the two languages. In addition, this paper proposed to use primitives representing language semantics to analyze the semantic differences between simplified and traditional languages. e rest of this paper is organized as follows. Section 2 introduces the related work. Section 3 puts forward a brief introduction to some basic concepts related to complex network analysis. en, in Section 4, we constructed networks with different text scales and carried out corresponding studies on the characteristics of complex networks, e.g., cumulative degree distribution, clustering coefficient, kernel lexicon, and motif analysis. Finally, Section 5 concludes the paper.

Related Work
At present, the comparison and analysis of the differences between simplified and traditional Chinese mainly remain at the level of character shapes or words. e main reason why readers find it difficult to read unfamiliar written materials in simplified or traditional characters is due to the difference in glyphs. Studies have shown that the actual number of characters that can be compared in the simplified and traditional Chinese character lists is 4,786 [6]. Among them, 41% of the simplified and traditional characters used in mainland China and Taiwan have the same glyph, totaling 1,947 characters; 24% of the similar glyphs, totaling 1,170 characters; and 35% of different glyphs, totaling 1,669 characters. Simplified and traditional Chinese belong to the same ancestor and developed from the same ancient Chinese. erefore, the differences between simplified and traditional Chinese need to be compared and analyzed systematically and comprehensively from the perspective of the language as a whole, which explores the differences between the two written forms of Chinese development status and law. However, the current comparative work of simplified and traditional Chinese characters has only achieved outstanding achievements on the level of character form and word, while other language levels (such as semantics and syntax) have not been involved.
As a typical hierarchical system, language exhibits a highly complex network structure at all levels (phonetics, morphology, syntax, and semantics) [25]. At present, a lot of research studies have been carried out on the complex characteristics of language networks on different levels, including lexical or vocabulary networks, word or character cooccurrence networks, and syntactic networks, the semantic networks. ese research studies are important for identifying and understanding the topological structure of language. Among them, the research studies of Chinese network mainly include the following: in terms of morphology or vocabulary network, Li et al. [13] used Chinese characters as nodes based on the principle that two Chinese characters can form words and constructed a Chinese phrase network and studied the dynamic characteristics of the phrase network; in terms of syntactic network, Liu [20] used the syntactic labeling tree bank to connect the words with syntactic relations and finally established the Chinese syntactic dependency network and explored the complex network characteristics of the syntactic network; in the semantic network (current research studies on Chinese semantic networks are still relatively small), Liu et al. [24] constructed a small semantic network to explore the complex characteristics of the Chinese semantic network; and Cancho and Solé [14] used the English-speaking country corpus to construct an English word co-occurrence network and found that the English language network has a small world and scale-free features. Liu and Sun [15] used the same construction method to construct a simplified Chinese word co-occurrence network. e experiment proved that the simplified Chinese word co-occurrence network has complex network characteristics similar to the English word co-occurrence network. Other works [12,26,27] used different construction strategies to construct a Chinese word, word cooccurrence network, and English word co-occurrence network based on different themes of Chinese and English (prose, novels, popular science articles, and news reports) corpora.

Foundations
In this section, some basic concepts are put forward. Section 3.1 describes the basic definitions of the complex network. en, Section 3.2 describes small-world networks and scalefree networks. Finally, Section 3.3 gives a brief introduction of motif analysis. 2 Computational Intelligence and Neuroscience

Basic Definitions.
In general, a network G can be denoted as a two-tuples (V, E), where V is the set of vertices and E is the set of edges. In a language network, a vertex v i (1 ≤ i ≤ |V|) may represent a radical, character, or word; and an edge e ij (1 ≤ i, j ≤ |V|) can characterize the relationship between v i and v j . Given a network, the conventional indicators, such as average path length, clustering coefficient, degree distribution, and cumulative degree distribution, are used to specify its statistical characteristics. ese indicators could be defined, respectively, as follows: Average Path Length (d): the average distance between two reachable vertices: where N is the number of vertices in the network, d ij is the distance between vertex v i and vertex v j which also means the number of edges in the shortest path linking them.
Clustering Coefficient (C): the percentage of the neighbours that two vertices share. e clustering coefficient of vertices i could be defined as follows [23]: where k i is the degree of vertex i and E i is the number of edges among the vertices in the nearest neighbourhood of vertex i. Moreover, the clustering coefficient of the whole network is the average of all individual C i , as follows:

Small-World Networks and Scare-Free Networks.
A complex network is called a small-world network, in which the average number of edges lying between any two vertices is very small, while the clustering coefficient remains large. Specifically, for an ER random network in a small-world network, d ER and C ER represent the average shortest path and clustering coefficient, respectively, and d is similar to d ER , but C ≫ C ER [28]. e degree distribution reveals the distribution of vertices by degree: and the percentage of the vertices whose degrees are k is represented as P (k): Under certain circumstances, a network is called scalefree if it fits the power law well and lies between 2 and 3 [29].

Motif Analysis.
Motif, a subgraph constructed by a few edges and vertices, was first used in biological academic area [30]. For a complex network, a motif represents a subnetwork containing a small number of nodes and edges. Biemann et al. [31] first applied motif analysis in linguistic networks and semantic features to explore the difference between natural language text and text generated by an Ngram language model in terms of semantic characteristics.
Besides, motif analysis involves an intermediate level of a network, which specifically means to count the motif constructed by n nodes to approach comparison among networks. As to undirected co-occurrence networks, n is usually at least 3. A 3-node motif is a triple-contained completely in calculating the clustering coefficient. erefore, we use 4node motif analysis to compare the semantic differences of co-occurrence networks. All six kinds of undirected 4-node motifs are shown in Figure 1

Experimental Comparisons
is section addresses the experimental comparisons between simplified and traditional Chinese based on methods from complex network science. Section 4.1 describes the dataset used as well as the construction of the word cooccurrence networks.
en, Sections 4.2-4.4 describe the comparisons on small-world and scale-free, kernel lexicons, and motif analysis, respectively.

Dataset and Network Construction.
In this experiment, texts from Chinese GigaWord ird Edition (LDC2007T38)(https://catalog.ldc.upenn.edu/ LDC2007T38) are used as the experimental materials, of which the simplified Chinese texts are from "Xinhua News Agency" (hereinafter referred to as XIN) and the traditional Chinese texts are from "Central News Agency" (hereinafter referred to as CNA).
Based on the datasets, word co-occurrence networks are built according to the method proposed by [32]. Concretely, words in the texts are regarded as nodes in the networks, and any two nodes are connected if the distance of the corresponding words is not greater than 2.
After the networks are constructed, their statistical properties are observed and compared. Please note that, only the networks built from the similar text scales are compared which avoids the influence of the text scales. In this experiment, three text scales are used, and the statistics of all the networks are shown in Table 1. For the co-occurrence network of simplified and traditional Chinese words under the same corpus scale, we designed three sets of experiments. e scales of the corpus used in these three sets increased from initial 7 million words to 10 million words and then 15 million words.

Small-World and Scare-Free.
Given the built networks, we use a complex network analysis tool, Pajek 2 to calculate the statistical properties of the networks. Table 2 shows the results.

Computational Intelligence and Neuroscience
From Table 2, we can find that all the networks satisfy d ≈ d ER and C≫C ER , which means that all the networks are small-world networks. However, it could also be observed that the average degrees of traditional networks are about 5 points larger than those of the corresponding simplified networks. e possible reason is the many-toone mappings between traditional Chinese and simplified Chinese, i.e., different words in |traditional Chinese have the same forms. For example, two traditional Chinese words "編制 (bi� an zhì)" and "編製 (bi� an zhì)" have that same form "编制 (bi� an zhì)" in simplified Chinese. It is the many-to-one mappings between traditional Chinese and simplified Chinese lead to larger numbers of nodes, edges, and average degrees.   Moreover, we plot the cumulative degree distributions of all the networks, as well as their fitting curves in Figure 2. It is clear that both traditional and simplified Chinese networks fit the power law well. In addition, the power-law exponents of all the networks belong to the range of 2 and 3, indicating that all of the networks are scale-free.

Kernel Lexicons.
By observing the cumulative degree distribution curves in Figure 2, we can learn that the scattered points can be fitted by two lines with different slopes. And the whole data set is divided into two parts at the crossover point. e more frequently a word is used in daily life, the more semantic meanings it may contain [33]. And the frequency f of a given word is relevant to its degree k, as follows: Followed [15], we may obtain a kernel dictionary by sorting words according to their degrees and selecting those with more degrees. Concretely, the capacity of kernel lexicons is calculated as follows: where N denotes the number of nodes, or specifically the number of words, and k cross denotes the percentage of the words whose degrees are not less than k cross , which is the number at the crossover point. Table 3 shows the sizes of the constructed kernel lexicons. From Table 3, we can learn that the sizes are all about 10 3 levels and satisfy the claim proposed by [15,34]. However, we observed the number of traditional Chinese kernel lexicons is much greater than that of simplified Chinese. Concretely, the traditional Chinese kernel lexicons are about 900 words, which are more than simplified Chinese in average.
To find out the possible reasons, we further analysis the part-of-speech tags and the lengths for the words in the kernel lexicons. e results are listed in Tables 4 and 5, respectively.
From Table 4, we found that, both forms of Chinese have a large proportion on entity words (noun and verb) whose orders are roughly the same. e percentage of verb in traditional Chinese is generally greater than that in simplified Chinese, indicating that verb weakening is an important development process in simplified Chinese.
From Table 5, we learned that kernel lexicons extracted from the traditional Chinese corpora contain more 1character words than the ones extracted from the simplified Chinese corpora.
is implies that traditional Chinese maintains some features of classical Chinese, while simplified Chinese does not.

Motif Verification.
Followed [31], we performed the motif analysis upon each networks constructed in Section 4.1.
e results are shown in Table 6. ere is no difference between simplified Chinese networks and the corresponding traditional Chinese networks, except that the traditional Chinese complex networks tend to have more motifs than the simplified Chinese ones which is due to the larger number of nodes and edges of the traditional Chinese networks. is shows that simplified and traditional Chinese are consistent on the semantics level.

Example Comparison.
We found that parts of speech of these different words are mainly reflected in nouns, verbs, time words, gerunds, adverbs, numerals, and ground nouns, as shown in Table 4. Among them, nouns, verbs, gerunds, and adverbs vary with corpus. However, there are also some words that are unique or frequently used in specific areas due to regional and political reasons, such as "总统", "中华民 国'", "卫生署", "社会主义", and "农民工"; time words, numerals, and geographical nouns also have different usage habits or frequency of use due to different regional cultures, such as "二零零五年", "2005年", "二十五", "25", "高雄县", and "长江".
is shows that many ancient Chinese words still appear in the written language of the traditional Chinese character system with a higher frequency, i.e., the written language of the traditional Chinese character system retains more classical Chinese characteristics.
In summary, the core dictionaries of the simplified and traditional Chinese character systems have a certain degree of versatility. However, in the process of language development, there have been some differences due to regional usage habits, environment, politics, and the generation of new words. In addition, in the development of the traditional Chinese character system, its written language still retains certain characteristics of classical Chinese.

Conclusion
In this paper, we proposed complex network to explore differences between simplified Chinese and traditional Chinese. To the best of our knowledge, this is the first work to use complex network-based approaches in comparing differences between simplified and traditional Chinese.
rough the comparisons, we achieve 3 interesting results. Firstly, both co-occurrence networks for simplified and for traditional Chinese are small-world and scale-free networks.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this article.

Authors' Contributions
All authors contributed equally to this paper. Zhongqiang Jiang designed main experiments; wrote the paper; improved the English expression, and corrected the typos and grammatical errors. Dongmei Zhao checked all the symbols, formulas, and algorithms and added some additional explanations during revision process. Jiangbin Zheng wrote the first draft; proposed the idea; participated in experimental discussion; and designed the overall paper structure. Yidong Chen provided guidance and helped to revise the paper structure during revision process.    Computational Intelligence and Neuroscience 7