Chinese Personal Name Disambiguation Based on Clustering

Personal name disambiguation is a significant issue in natural language processing, which is the basis for many tasks in automatic information processing. This research explores the Chinese personal name disambiguation based on clustering technique. Preprocessing is applied to transform raw corpus into standardized format at the beginning. And then, Chinese word segmentation, part-of-speech tagging, and named entity recognition are accomplished by lexical analysis. Furthermore, we make an effort to extract features that can better disambiguate Chinese personal names. Some rules for identifying target personal names are created to improve the experimental effect. Additionally, many calculation methods of feature weights are implemented such as bool weight, absolute frequency weight, tf-idf weight, and entropy weight. As for clustering algorithm, an agglomerative hierarchical clustering is selected by comparison with other clustering methods. Finally, a labeling approach is employed to bring forward feature words that can represent each cluster. The experiment achieves a good result for five groups of Chinese personal names.


Introduction
The ambiguity of named entities is a prevalent phenomenon in natural language. There is considerable ambiguity about the personal name in the texts or the web pages, especially in the Chinese dataset. The Chinese personal name "Gao Jun (高军)" has a total of 51 items in the Baidu Encyclopedia. Eliminating the ambiguity of such personal name is beneficial to many tasks like information retrieval and data summarization. Take searching a person name on the Internet for example, documents of different person entities with the same name can be found by search engine. It is necessary to divide the documents into clusters automatically and secure the key information of each cluster. This research focuses on this task of importance and attempts to solve the problem by unsupervised approaches.
Chinese personal name disambiguation involves distinguishing between people with an ambiguous name in Chinese corpus. Initially, documents with the html format in raw corpus are processed into plain texts. Then, the lexical analysis of documents is performed, including segmentation, part-of-speech tagging (POS tagging), and named entity recognition (NER). Feature selection is enforced according to the result of lexical analysis. In order to acquire better accuracy of personal name recognition, some rules of personal name extension are proposed for target names to be disambiguated. For instance, the family name and first name of a target name may be separated in some situation due to the segmentation errors. We merge them into one complete personal name with the purpose of reducing the number of discarded documents. Further, an agglomerative hierarchical clustering algorithm is adopted to discover different clusters containing the same personal name. Finally, the label of each cluster is given by scoring the weight of each feature word in cluster. The feature words chosen as the cluster label can represent the person entity with significant information.
The rest of this article is arranged in the following parts. Related work of this task is introduced in Section 2. Research framework and methodology are elaborated in Section 3. Section 4 gives the experimental results and some discussions. Conclusion and future work are discussed in Section 5.

Related Work
The personal name disambiguation task is similar to the word sense disambiguation (WSD). Both of them pursue the goal of resolving the ambiguity in natural language understanding. Nevertheless, there is a big difference between two tasks. The number of person entities with an ambiguous personal name is usually unknown for the name disambiguation task, which is contrary to WSD. Hence, personal name disambiguation is often implemented with an unsupervised clustering.
There are many research directions in personal name disambiguation. Song et al. [1] exploited two topic-based models to extract features from corpus and achieved a good effect for personal name disambiguation. Zhao et al. [2] made use of the personal ontology to complete feature extraction and similarity calculation on two real datasets, where the highest similarity is selected for disambiguation. Xu et al. [3] utilized a network embedding-based technique to disambiguate the author name, in which networks are created from papers that have a target ambiguous author name. Yu and Yang [4] solved the challenging task under the circumstances of inadequate data sources. A feature learning means and an affinity propagation clustering were taken into account. Kim et al. [5] combined global features with structure features for author name disambiguation. Global features, extracted from attributes of dataset, formed the textual vector representation. Moreover, negative samples were employed to train a global model. Protasiewicz and Dadas [6] produced a hybrid framework considering both rule-based method and agglomerative hierarchical clustering. Rules were generated from the knowledge of experts, analysis, and so forth. A function C index was also proposed to determine the best threshold for stopping the hierarchical clustering algorithm. Du et al. [7] applied spectral clustering to recognize ambiguous names in large-scale scientific literature datasets. A distributed approach using Spark framework was advanced to perform in large-scale datasets. Pooja et al. [8] concentrated on the namesake issue of author name disambiguation. They presented an ATGEP method by taking advantage of a graph theory combined with an edge pruning operation.
The research on personal name disambiguation in Chinese datasets is also studied among a number of scientists. Chen et al. [9] provided a feature weighting scheme by calculating pointwise mutual information between personal name and feature word. A trade-off indicator was designed to measure the quality of clusters and stop hierarchical clustering. Li and Wang [10] developed a multistage clustering algorithm based on the entity knowledge, which can be used for Chinese named entity recognition and disambiguation. Ke et al. [11] handled the author name disambiguation under the condition of insufficient information and missing data. Their algorithm devised a novel combination of indicator and incorporated back propagation neural networks.

Framework and Methodology
3.1. Research Framework. The research framework of disambiguating personal name is depicted in Figure 1. There are four main parts in the framework: preprocessing, feature selection, clustering, and labeling. The detailed procedure can be presented in the following steps. where df represents the count of documents having a certain term. A word with df = 1 should be ignored since it cannot contribute to the discrimination between documents.
Due to the weakness of the Chinese word segmentation tool, the personal name in the document may not be correctly identified. A series of rules for name extension are devised according to the results of word segmentation, which is shown in Table 1. Part-of-speech "nr," "nr1," "nr2," "nrf," and "ng" represent "personal name," "Chinese surname," "Chinese given name," "transcribed personal name," and "noun morpheme." "w" denotes "punctuation mark." The extension of personal name improves the accuracy of target name recognition.
Also, feature selection can be performed in the whole document (Document) or the paragraphs (Paragraph) encompassing the target personal name. Different schemes give birth to different results.
(i) Frequency Weights. This weighting scheme gives each word an absolute frequency, which is the number of occurrences of word i in document j.
A word that appears in only a few documents is likely to be a better discriminator than one that occurs in most or all documents. Inverse document frequency (idf) gives greater weight to words that appear in fewer documents. The tf-idf weight assigned to word i in document j can be calculated by formula (3).
where tf ij is f ij divided by total number of words in the document. N is the count of documents in entire collection and df i is the number of documents with word i (i) Entropy Weights. The entropy weight method introduces the concept of entropy to measure the distribution of words i in document j, so the basic idea of the entropy method is similar to idf. It can be defined as follows formula (4): 3.4. Clustering Algorithm 3.4.1. Hierarchical Clustering. Hierarchical clustering can be divided into two types: divisive and agglomerative. This paper chose the latter because the complexity of divisive clustering algorithm is relatively high and not practical for this task. Agglomerative hierarchical clustering belongs to a bottom-up method [12]. It treats each document containing target personal name as a separate cluster in the beginning. The algorithm merges two most similar clusters into a larger one at each step until the maximum similarity of clusters exceeds a preset threshold or there is only one cluster left. For the similarity formula, this paper calculates cosine of the angle between the vectors x = ðx 1 , x 2 , ⋯, x n Þ and y = ðy 1 , y 2 , ⋯, y n Þ [13]. It can be written as formula (5).   repeatedly assigned to different clusters according to the closest centroid. Then, the centroid of each cluster will be recomputed. The iteration stops when a convergence criterion is satisfied or after a fixed number of iterations.

Spectral Clustering.
Spectral clustering is a type of graph-based clustering. It utilizes the eigenvalues or spectrum of the similarity matrix to achieve the goal of dimensionality reduction. Documents can be assigned to different clusters based on the lower-dimensional representation. There are three basic stages in spectral clustering, including preprocessing, decomposition, and grouping.

GMM Clustering.
Gaussian mixture models (GMM) clustering, also known as expectation-maximization (EM) clustering, makes use of the optimization strategy to cluster unlabeled documents. GMM assumes that data are generated by a Gaussian distribution and tries to obtain a mixture of multidimensional Gaussian probability distributions which can best model any dataset.

Labeling Approach.
In order to summarize the person information of each cluster produced by the clustering algorithm, a labeling step is necessary. A simple way of creating a label is to choose a group of representative feature words by ranking the weights of all feature words in cluster.
The labeling algorithm [9] combines mutual information (MI) with tf to score the weights. For each feature word x i in cluster C k , the score is calculated by formula (6). MIðx i , nameÞ measures the mutual information between the feature word and personal name. tf ðx i , C k Þ counts the number of x i appearing in cluster C k . We can acquire a label of k words by taking the top k feature words in the scoring process. Chinese word segmentation, parts-of-speech tagging, and named entity recognition are performed on corpus. Two types of segmentation and tagging toolkits are exploited: ICTCLAS (http://ictclas.nlpir.org/) and LTP(http://ltp.ai). As the performance of word segmentation has an important impact on the accuracy of personal name recognition, we compared the number of discarded documents for two toolkits (see Table 2). Gold standard gives the real number of discarded documents. Result shows the personal name recognition of ICTCLAS is more precise than LTP.

Evaluation.
Purity and inverse purity [14,15] are taken as precision and recall for evaluating the clustering effect. Suppose S is the cluster set to be evaluated and R is the manually labeled category set. The definition of purity and inverse purity can be described by formula (9) and (10).
The F score calculates the harmonic mean of precision and recall, which is defined by formula (11). The overall F score of five personal names is the average of these values.

Result and Discussion
The Chinese personal name is recognized by the lexical analysis tool ICTCLAS in this research. We classify all documents containing the same personal name into one   Table 3. From Table 3, the hierarchical clustering algorithm outperforms other methods. Additional experiments utilizing other features showed similar results and verified the advantage of hierarchical clustering.
After choosing the clustering algorithm, different combinations of features are adopted for comparison. Table 4 summarizes the clustering results with tf-idf weights.
Named entities (NE) are reckoned as the baseline in this experiment. The F score increases significantly when other noun features are included (NE belongs to noun). However, adding verb features will lead to a drop in F score in comparison with only using noun features. Even though some verbs help to identify the identity of person, most verb features are very limited in disambiguating the personal name. A large number of unrelated verbs will bring noisy data in features, leading to poor experimental results.
Owing to the inaccuracy of word segmentation tools, personal names in documents cannot be effectively identified. When the recognition of an ambiguous name fails, documents that should be clustered are discarded. Therefore, personal name extension is introduced by setting some rules to identify target names. The results presented in Table 4 suggest that name extension dramatically improves the F score of clustering results.
The scope of feature selection can be either in the document or in each paragraph having the ambiguous name. According to Table 4, the feature selection in the whole doc-ument yields better results than in paragraph with the exception of noun feature.
Results of four feature weighting schemes are shown in Table 5 with highest F score. The corresponding features are listed in parenthesis in the first column. As can be seen  Figure 2: Distribution of F score at different thresholds.  The detail result of every personal name is described in Table 6 when the average F score for all names achieves a best value.
Clustering algorithm stops when the similarity between two clusters is less than a certain threshold. The relationship between threshold and F score can be illustrated in Figure 2. The influence of threshold on results depends on feature sets. We choose feature 6 (N + NameEx) plus document to run clustering. The value of threshold is given through enumeration by every 0.01 step. The F score reaches a highest value of 90.15% when tf-idf weight is selected.
Basic information about a person is given by labeling process. For instance, clusters of Gao Jun(高军) are labeled with meaningful words in Table 7. The created labels are representative words that can summarize the characteristics of a person.

Conclusions
This paper studied the task of Chinese personal name disambiguation based on an unsupervised method. The open dataset contains five ambiguous names with gold standard. We exploited lexical analysis toolkits to perform segmentation and POS tagging. Eight groups of features are selected to combined with four feature weight calculating methods. In order to refine the precision of personal name recognition, name extension is proposed. The extension process of personal name significantly enhances the final effect of clustering experiments. Besides, the agglomerative hierarchical clustering algorithm is chosen from four methods for disambiguating names. The threshold of hierarchical clustering is also tested for different feature weights. At last, labels are constructed for clusters of target name by scoring the weights of feature words in clusters.
Final experimental results demonstrated the effectiveness of the proposed research approach. Nonetheless, some disadvantages may exist in the framework. Rules of personal name extension are suited for the current dataset. It may be necessary to add extra rules for other corpus so as to increase the precision of detecting Chinese personal names. In addition, we will develop automatic feature selection algorithms as well as new weigh calculating methods in future work. More sophisticated clustering and supervised document classification methods will also be taken into consideration.

Data Availability
The original dataset used in this work is available from the corresponding author on request.

Conflicts of Interest
The authors declare no conflicts of interest.