CAREA: Cotraining Attribute and Relation Embeddings for Cross-Lingual Entity Alignment in Knowledge Graphs

Knowledge graphs (KGs) are one of the most widely used techniques of knowledge organizations and have been extensively used in many application fields related to artificial intelligence, for example, web search and recommendations. Entity alignment provides a useful tool for how to integrate multilingual KGs automatically. However, most of the existing studies evaluated ignore the abundant information of entity attributes except for entity relationships. 'is paper sets out to investigate cross-lingual entity alignment and proposes an iterative cotraining approach (CAREA) to train a pair of independent models. 'e two models can extract the attribute and the relation features of multilingual KGs, respectively. In each iteration, the two models alternate to predict a new set of potentially aligned entity pairs. Besides, this method further filters through the dynamic threshold value to enhance the two models’ supervision. Experimental results on three real-world datasets demonstrate the effectiveness and superiority of the proposed method. 'e CAREA model improves the performance with at least an absolute increase of 3.9% across all experiment datasets. 'e code is available at https://github.com/ChenBaiyang/CAREA.


Introduction
Knowledge graphs (KGs) that possess machine-readable representations of factual knowledge are becoming the basis for many applications such as web search (Google and Bing), recommendations (Amazon and eBay), and social networks (Facebook and Linkedin). Multilingual KGs (e.g., DBpedia [1], YAGO [2], and ConceptNet [3]) are constructed in separate languages from various data sources and contain a wealth of complementary facts. e bridging of language gaps and the improvement of user experience from downstream cross-language applications benefit a lot from the entity equivalent in multilingual KGs. Hence, aligning the entities in multilingual KGs has recently attracted an increasing amount of research attention and is called the problem of cross-lingual entity alignment.
Most existing entity alignment methods entirely rely on the graph structures, while the abundant attribute information in KGs remains unexplored. e attributes of an entity represented by different languages often share enormous semantic information, leading to a potentially valid view of the entities connected to multilanguage KGs. However, it is nontrivial to capture and make use of such information for cross-lingual entity alignment. First, attribute information can be quite diverse across different KGs. e most likely cause of the differences is that there exist distinct attribute concerns in the process of developing applications. Second, the semantic association of attributes cannot be modeled directly since the critical entity expression languages are different. Moreover, the simultaneous use of relationships and attributes across multilingual KGs is a near term challenge in the area of knowledge graphs.
Cotraining is a popular machine learning method, where two complementary models utilize a large number of unlabeled examples to bootstrap the performance of each other iteratively [4,5]. Cotraining can be readily applied to multilingual tasks since the data in these tasks have two or more views (i.e., a subset of features). It is also applicable to employ cotraining to the task of entity alignment across multilingual KGs, as the entity attributes and graph structure information naturally form two independent views of a KG. In the cotraining framework, each model is trained on one of the two views, under the assumption that either view is sufficient to make a prediction. In each iteration, the cotraining algorithm selects high-confidence samples ranked by each of the models to form new auto-labeled data samples and then uses both labeled data and additional autolabeled data to update the other model. is paper introduces a cotraining based approach CAREA to learn embeddings from two independent views of knowledge (relationships and attributes) in multilingual KGs. CAREA iteratively trains two-component models that are called attribute-based model f attr and structure-based model f struc , respectively. f attr extracts the attribute features according to attribute occurrence frequencies and value data types, which also employs a Multilayer Perceptron (MLP) to transform both KGs into a unified vector space. On the other hand, f struc adopts a graph attention mechanism to capture the multirelation characteristics of KGs. During each iteration of the cotraining process, both models alternately predict a set of new potential aligned entity pairs to strengthen the supervision of cross-lingual learning. Such collaborative predictions gradually improve the performance of each model. To improve the accuracy of aligned entity prediction, we further evaluate the predicted entity pairs through a dynamic threshold. Experimental results on three real-world datasets demonstrate the effectiveness and superiority of our proposed method CAREA. e rest of this paper is organized as follows. Section 2 summarizes the related works. Section 3 formally defines the research question. Section 4 introduces the proposed approach. Section 5 presents the experimental results. Finally, we conclude this work in Section 6.

KG Embedding.
Embedding-based entity analysis approaches have demonstrated their effectiveness in modeling the semantic information of KGs, which aim to project entities into low-dimensional embedding spaces. e KG embedding model TransE [6] interprets a relation as the translation from an entity to another. Such KG embedding models using the translations have shown their feasibility and later been improved by several subsequent studies, such as TransH [7], TransR [8], and TransD [9].
TransH and TransR update the modeling of multimapping relations of TransE from one to many. TransD uses a dynamic matrix to transfer entities and relations rather than a fixed one. R-GCN [10] is a similar model that incorporates relation type information by setting a transformation matrix for each relation. Some authors consider avoiding the use of translation approaches for KG embedding, including [11][12][13][14][15]. e perfect example is shown in the study of Nathani et al. [15], which extended the graph attention mechanism (GAT) to capture entity and relationship features in the multihop neighbors of a given entity. Some research below makes use of the additional information in KGs to improve embedding performance. For example, reverse triples and relational paths are combined in PTransE [16]. e categorical attributes such as gender and hobby are introduced in KR-EAR [17]. In addition, some works explore the type, local structure, and global patterns in KG embedding [10,[18][19][20].

Entity Alignment.
Entity alignment aims to automatically determine whether an entity pair in different KGs refers to the same entity in reality. Traditional entity alignment methods take advantage of various features of KGs, such as the semantics of OWL properties [21], compatible neighbors and attribute values of entities [22], and the relation structures [23].
Many recent studies have used embedding methods to deal with the alignment problem in KGs. MTransE [24] deploys three mechanisms, distance-based axis alignment, translation vectors, and linear transformations to learn multilanguage KG embeddings. An improved model IPTransE [25] combines the advantages of TransE and PTransE to embed a single KG. en, an iterative and parameter sharing step is added in IPTransE for various KGs embedding. BootEA [26] improves on JAPE [27] by using the bootstrapping strategy, which provides an iterative data labeling method. Accordingly, the constructed training data for potential entity alignment can be used to learn entity alignment-oriented embedding. MuGNN [28] learns alignment-oriented KG embeddings by a multichannel mechanism that encodes KGs via KG completion and entity pruning. NAEA [29] merges neighborhood subgraph-level information of entities and designs a neighborhood-aware attention representation mechanism on cross-lingual KGs. RDGCN [30] proposes a relation-aware dual-graph convolutional network to leverage relations through attentive interactions between the KG and its dual relation counterpart. MRAEA [31] learns cross-lingual entity embeddings by attending over the entity's neighbors and the meta semantics of its connecting relations. Some literature on cross-lingual entity alignments has highlighted the role of both KG structures and attributes. JAPE [27] embeds the structures of different KGs into a uniform hidden space and uses the attribute correlation of KGs to realize the refinement of entity embedding. However, attribute components can significantly degrade the performance of JAPE's structural components when attributes are heterogeneous or have a confused association between the attributes. Graph convolutional networks (GCNs) [32] are also employed in the study [33] to learn embeddings from both the structure and attribute information of entities for cross-lingual alignment.

Problem Definition
In a KG, facts are mainly stored in two types of triples < entity, attribute, value > and < entity, relation, entity > , which are called attribute triple and relation triple, respectively.
is paper denotes a KG as G � (E, R, A), where E � e 1 , e 2 , . . . , e N is the set of entities, R � (e i , e j )|e i t, ne j q ∈ hE is the set of relations, 2 Discrete Dynamics in Nature and Society and A � a e i |e i t ∈ nE represents the set of attributes in the KG. Each attribute of an entity consists of a set of keyvalue pairs. Definition 1. Cross-lingual entity alignment: let G (1) � (E (1) , R (1) , A (1) ) and G (2) � (E (2) , R (2) , A (2) ) be two arbitrary KGs in different languages. e entity pairs that refer to the same real-world object are called prealigned entities, denoted as L � (e (1) i , e (2) j )|e (1) i t ∈ nE (1) q, he (2) j ∈ xE (2) . e task of cross-lingual entity alignment is to find hidden aligned entity pairs M � (e (1) i , e (2) j )|e (1) i t ∈ nE (1) q, he (2) j ∈ xE (2) 7, C (e (1) i , e (2) j ); ∉ L} based on the prealigned pairs L.

Proposed Approach
Overview. e details of the proposed model CAREA taken in this section are based on the cotraining algorithm. Its framework is shown in Figure 1.
We construct two independent models: attribute-based model f attr and relation-based model f struc . e advantage of the cotraining algorithm is that it reinforces the performance of the two models in the process of iterations. Both models are retrained with the prealigned entity pairs L and predict the new pairs of potential aligned entities at each iteration. Subsequently, dynamic thresholds are used to filter the anticipated results further. e method merges the remaining entity pairs into L for the next iteration until convergence.

Attribute-Based
Model. In our scenario, the attributes of an entity consist of a number of key-value pairs, for example, < name:Michael > , where "name" is the attribute key, and "Michael" is the attribute value. For simplicity, a key-value pair is also called an attribute.

Attribute Extension.
A critical problem of attribute representation is that some actual attributes may not be observed since they are not explicitly built or captured by the crawlers. erefore, we first extend the attributes of both KGs by using the prealigned entity pairs. Typically, given a couple of aligned entities, if one entity has an attribute in a KG, the other KG's corresponding entity also has this attribute. Based on such an observation, we can add a keyvalue pair to one entity in a KG if its counterpart in the other KG has this key-value pair. Formally, the attributes of an entity e i are denoted by a e i � p 1 , p 2 , . . . , p j , . . . , where p j � < key j : value j > is a key-value pair. For each entity e i in KG 1, its attribute a (1) e i can be extended to a (1) Similarly, its counterpart a (2) e j in KG 2 can be extended into a (2) e j .

Attribute Feature Representation.
In multilingual KGs, the attributes are in different languages and cannot be directly compared. However, our observation shows the following: (1) e occurrence frequencies of equivalent attribute pairs, that is, attribute keys, in multilingual KGs are approximately similar. For example, an entity representing a person in different KGs often has some equivalent attributes such as name, date of birth, and nationality. Although the texts that describe these attributes are multilingual, their frequency in different KGs is similar to the ratio of person entities to all KG entities. (2) e values of an equivalent attribute pair in different KGs has the same data type. For example, both the English word "Michael" and the Chinese word "Mai Ke" are strings, and both "3.14" and "3.14159" are floating numbers.
Hence, this study represents the attributes of an entity by its attribute key frequencies and attribute value types. e description of entity attribute features can be illustrated briefly by a concrete example shown in Figure 2.
First, the attribute triples in each KG are merged into a set of key-value pairs, where the keys and values are then used to represent the frequency and the type features, respectively. e frequency F p j of an attribute p j is a floating number ranging from 0 to 1, calculated as follows: where Count p j is the occurrence number of attribute p j in a KG and |E| is the total number of entities in the KG. In this example, the frequencies of entity "Michael"'s nationality and birthdate are 0.2162 and 0.3351, respectively. Second, we divide the frequency range (i.e., the interval of [0, 1]) into a sequence of small real intervals. Its frequency interval number can represent an attribute. In this paper, a proportional sequence is applied to split the frequency range. e interval for an attribute p j can be computed by where q is the proportionality constant, and we fix q � 0.001 in this paper. F min i is the least frequency of occurrence of an attribute in a KG. For example, "Nationality" and "Guo Ji" in Figure 2 are both in interval 2, although their frequencies are different. e parameter setting makes the interval more robust to small changes caused by noise, especially when more different frequency attributes are merged into one interval.
e value type of an attribute is its data type. Following previous work [27], this study distinguishes four kinds of data types, that is, Integer, Double, DateTime, and String. We encode the value type by a one-hot vector with the same dimension to the number of data types. For example, e Discrete Dynamics in Nature and Society codes for the attribute value "America" and "1958-08-29" are [0, 0, 0, 1] and [0, 0, 1, 0], respectively.
As explained in the above two steps, it is clear that the primary ideal of attribute feature representation is to integrate the representations of the frequency and the type of an attribute p j . We combine the two representations into a sparse matrix as shown in Figure 2. Each row in the matrix denotes the value type, and the row index is the frequency interval number. On top of that, we reshape this matrix into a row feature vector p j �→ . In this way, the attributes a e i for an entity e i can be formed by the sum of its every attribute vectors p j �→ as To reduce noise, we use an indicator function I(·) to transform the attribute vector of an entity into the following binary representation: e binary representation is averaged by the entity's neighbors as where N e i denotes the neighboring entities of e i . en, a three-layer MLP transforms the attribute vectors of the two KGs into a uniform vector space, making the equivalent entities in different KGs close to each other. e MLP output is considered as the embedding of an entity, which is represented as h attr e i . We use ReLU as an activation function in this paper. Batch normalization and dropout are added to increase performance. e details of the objective function are introduced in Section 4.4.

Relationship-Based
Model. In KGs, there are various types of relations describing the role of entity associations that are crucial to aligning entities across KGs. Many previous works represent a relation by a transformation of the relationship connected entities. However, these methods bring the relationship too close to the entity [31]. erefore, it will be difficult to capture the features of multiple relationships. Hence, this paper represents the entities and relations separately. eir combinations are adopted as the inputs to a graphical attention network (GAT) [  result, the two KGs are embedded into a unified vector space so that the equivalent entities in different KGs are close to each other. is study treats relations as undirected; that is, (e i , e j ) is equivalent to (e j , e i ). e idea of GAT is to calculate each entity's hidden representations in two KGs by focusing on their entity neighbors. GAT follows a self-attention strategy in its learning process. First, the embedding h e i ∈ R d×1 of each entity e i and its connected relation h r j ∈ R d×1 are randomly initialized.
is study sets the embedding dimension of entities and relations as the same d. Second, we average the entity e i with its neighbors. en, the entity embedding and the averaged embedding of entity connections are concatenated as the input to the GAT network as where N e i represents e i 's neighboring entities, N r i represents the set of relations that are outward from e i , and notation ‖ represents the concatenate operation. e attention coefficients can be calculated by where e ij indicates the weighted importance of neighboring e j to e i and a ∈ R 2 d×1 is the shared attention weight vector. Different from the original GAT, there is no weight matrix W for the input feature in equation (8). In this study, all adjacent entities are normalized using softmax function and LeakyReLU nonlinearity with negative input slope α � 0.2. Such normalization makes the coefficients between different nodes easy to compare, which can be denoted by Nonlinear ReLU is applied to the combination of participating neighbors. e operation yields the output features of each entity: e stability of the training process is prepared by adopting a multihead mechanism. Specifically, K independent heads of attention execute the transformation of equation (10). en, the averaged features result in the following output: where k is the indicator of heads and α k ij represents the attention coefficient in the k-th head. is study also expands the attention mechanism to multihop neighboring level information by adding more layers, thus creating a 4.4. Objective Function. As was mentioned in the previous section, the two models both provide embeddings of the entities for two KGs from different views. is section uses the same objective function to optimize both of them. Following the previous work [33], Manhattan Distance is employed to be the similarity measure. e similarity of e i ∈ G (1) and e j ∈ G (2) in the joint vector space can be calculated by s e (1) i , e (2) All similar entities in G (2) should be calculated using the same method to find the entity e i 's counterpart. e nearest one is chosen as e i 's equivalent. On top of that, we adopt the following margin-based loss function since it ensures the balance between positive and negative samples and ensures the lower scores for positive ones; that is, where [·] + represents max(·, 0) and c is a hyperparameter of margin. e i ′ and e j ′ are the negative counterpart of e i and e j , respectively. In this work, the entities in G (1) and G (2) are randomly selected as negative counterparts. Adam [35] is adopted to minimize the loss function.

Cotraining Algorithm.
In this study, the cotraining process of the attribute-based model f attr and relation-based model f struc is conducted iteratively. Two components alternately take turns to train and predict new potential aligned entity pairs at each iteration until either of the two parts no longer obtain new pairs. Such a prediction is based on the cosine similarity of entities in the united vector space. A new pair sourced from a KG is suggested by searching the nearest neighbor (NN) in the other KG. It is worth noting that, in most cases, the NNs are asymmetric. For example, although e i in G (1) is the most similar entity to e j in G (2) , there may be another entity in G (2) closer to e i . us, the newly predicted entity pairs should be bidirectional nearest neighbors.

Dynamic Similarity
reshold. We further evaluate the predicted potential aligned entity pairs by dynamically adjusting the threshold in each iteration. at is to say, only the entity pairs, whose cosine similarity falls within a certain threshold τ, are populated into the aligned pair set L. As higher similarity threshold implies higher precision, we set higher thresholds for earlier iterations. However, it may also Discrete Dynamics in Nature and Society limit the capability of the model to propose a sufficient number of aligned entity pairs. us, lower thresholds are taken for later iterations. e design of the threshold function is alternative. In this paper, we design a linear threshold function as where α ∈ (0, 1) is the initial threshold, n ∈ 0, 1, 2, . . . { } is the iteration number, and δ ∈ (0, 1) is the coefficient controlling the changing rate of each iteration. In order to control the precision of each component model, we set different threshold parameters for different models in our experiments. e detailed cotraining procedure of CAREA is given in Algorithm 1.

Datasets.
is section applies a popular public dataset DBP-15K [27] for entity alignment to evaluate the performance of the approach CAREA. DBP-15 K contains three cross-lingual subsets built from DBpedia: DBP ZH−EN , DBP JA−EN , and DBP FR−EN . Each of the three subsets contains two KGs in different languages, for example, DBP FR−EN for French and English. eir statistics are displayed in Table 1.

Experiment Settings.
Following the previous work [29], we adopt two evaluation metrics: (1) Hits @ k: the proportion of correctly aligned entities ranked in the top k. (2) Mean Reciprocal Rank (MRR): the average of the reciprocal ranks of results. Higher Hits @ k and MRR scores indicate better alignment performance. e two metrics can be calculated as follows: where pos(·) is the position of tested entity pair in the returned list, T is the set of tested entity pairs, and Hits @ 1 and Hits @ 10 are adopted in our experiments. is research is compared with other baselines by the same evaluation metrics and the dataset's splitting method. e experiment randomly splits 30% of the prealigned entity pairs as training data, while the rest 70% for testing. e average score of both alignment directions (e.g., ZH ⟶ EN vs. EN ⟶ ZH) is reported by considering the asymmetric of the nearest relation across KGs. Each experiment instance is run five times independently. eir average performances are considered as the final results. e same settings for experiment models are applied unless otherwise stated. e hidden dimensions for attributes, entities, and relations are the same: d � 100. e margin parameter c, dropout rate, and learning rate of Adam are 3, 0.3, and 0.005, respectively. For the relation-based model f struc , we fix the number of attention heads K � 2 and GAT layer's depth l � 2. For the cotraining process, we take the results of CAREA's third iteration as its final performance. On the other hand, the threshold coefficients for both model components are the same as δ � 0.05. e initial thresholds α for f attr and a struc are empirically set to 0.95 and 0.9, respectively.

Baselines.
To demonstrate the advantage of our method, we compared it with the following baselines: (i) MtransE [24]: MtransE is a structure-based model for multilingual KG embeddings, to provide a simple and automated solution. e model characterizes monolingual relations and deploys three different techniques to represent cross-lingual transitions, namely, axis calibration, translation vectors, and linear transformations.
(ii) JAPE [27]: JAPE is an attribute-preserving embedding model that incorporates the relation and attribute embeddings for entity alignment. (iii) GCN-Align [33]: GCN-Align employs GCNs to learn embeddings from both the structure and attribute information of entities for cross-lingual KG alignment. (iv) BootEA [26]: BootEA adopts a bootstrapping strategy, which iteratively labels potential entity alignments as training data and leverages it for learning alignment-oriented embeddings. (v) MuGNN [28]: MuGNN learns alignment-oriented KG embeddings by robustly encoding two KGs via KG completion and entities pruning. (vi) NAEA [29]: NAEA incorporates neighborhood subgraph-level information of entities and designs a neighborhood-aware attentional representation mechanism on multilingual KGs. e performances of the above baselines come from the reported results in their papers. We also evaluate the effectiveness of the component models of our approach, including the following: (i) Attribute-based model: the model, denoted as CAREA-a, ignores structure embedding components to assess the effect of the attribute embedding. In other words, the attribute features are only used to align entities without a cotraining strategy. (ii) Structure-based model: we also estimate the component's performance of network structure embedding, which ignores the attribute features except for the structure ones to align entities. e model is similarly denoted as CAREA-s. e other leverages both entity attributes and relations for entity alignment, including JAPE, GCN-Align, and CAREA. Table 2 summarizes the overall results of all compared methods on the three datasets.

Experiment
In the structure-based group, our model CAREA-s performs better than MTransE by at least 27.3% in terms of Hits @ 1 on three datasets, which also resulted in a greater score than MuGNN by at least 8.4%. e comparison results have demonstrated the effectiveness of our structure-based approach. Further tests in other group revealed that CAREA outperforms JAPE and GCN-Align by at least 28.6% across all datasets on Hits @ 1. e result demonstrates our approach's superiority by leveraging both entity attributes and KG structures for entity alignment. Finally, CAREA ranked the best over all the competing approaches across all datasets. For example, the performance of CAREA is better than NAEA and BootEA by at least 3.9% and 6.0% in terms of Hits @ 1, respectively. Another proposed component model CAREA-a is excluded from the above comparisons since it is the only one that relies solely on attribute information to align entities. It achieves lower scores. For example, when CAREA-a were stimulated with Hits @ 1 and Hits @ 10 on DBP ZH−EN , lower scores of 22.1% and 51.8% were reported. is is mainly because of the heterogeneity across multilingual KGs or the KGs may not be explicitly constructed or captured by the crawlers. Although CAREA-a does not perform as well as that of structure-based approaches, it provides another view for entity alignment and improves the performance of our method in the KG alignment task.

Effects of Cotraining Algorithm.
is part confirms the achievements of the approach CAREA by presenting each iteration of the cotraining process. e result is shown in Figure 3. Its trend reveals that there has been a gradually similar increase in terms of all evaluation metrics, which is verified on all the three datasets of our two-component models.
e iterative cotraining algorithm significantly improves the performance, with at least an absolute increase of 10.5% Hits @ 1 across all experiment datasets.
Both the attribute f attr model and the structure model f struc get enhanced from each iteration. After 3 to 4 iterations, the component model performance becomes stable.

Parameter Sensitivity Analysis.
is part investigates the parameter sensitivity of the proposed CAREA on three primary parameters: (1) the proportion of the prealigned entity pairs, (2) the feature dimension d, and (3) the margin parameter c of the proposed objective function.
Sensitivity to data proportions. We run CAREA by the training proportions from 10% to 50% with a step size of 10%. Figure 4 illustrates the change of Hits @ k concerning different proportions. e shown results on all the datasets become better with the proportion increase following our expectations. e amount of data is a significant factor that more training data can provide more extended information to overlay the cross-lingual KGs. Figure 4 shows that CAREA performed encouragingly when using only 10% of the aligned entities as training data. For example, Hits@1 and Hits @ 10 on DBP ZH−EN are 56.2% and 82.4%, respectively. erefore, CAREA is expected to be well adapted to annotate constrained scenarios. Sensitivity to the feature dimension d: Figure 5 depicts the sensitivity of the model performance on different feature dimensions. e model performance of CAREA Input: Two KGs to be aligned G (1) and G (2) , a set L of prealigned entity pairs, and the parameters α 1 , α 2 , δ 1 , δ 2 of threshold function.
Output: e parameters of f attr and f struc .
Sensitivity to the margin c: the model performance produced by setting different margin parameter c (from 1 to 4) of the objective function is shown in Figure 6. e performances become steady when c ≥ 2 with at most 2.5% range from all datasets. erefore, CAREA can keep stable when c varies within a reasonable range.

Conclusion and Future Work
e purpose of the present research was to investigate the cross-lingual entity alignment problem in KGs. is study has constructed a cotraining based approach CAREA to learn entity embeddings from two independent views of knowledge (relationships and attributes). CAREA is innovatively constructed as a two-component model f attr and f struc , which can extract the attribute and the relation information, respectively. In each iteration, both models alternately take turns of the train-and-predict process, which gradually improves each model's performance. Experiments on three popular datasets confirm the effectiveness and superiority of CAREA on the entity alignment task. e insights of model construction gained from this study may be of assistance to complex multilanguage and cross-domain knowledge organization and analysis. Future work seeks to extend the CAREA method to other applications, such as link prediction, information extraction, and entity classification.

Conflicts of Interest
e authors declare that they have no conflicts of interest.