Knowledge Graph Representation via Similarity-Based Embedding

,

Knowledge representation learning [17][18][19] is considered as an important task to extract the latent features from associated space.Recently, knowledge embedding [20,21], an effective method of feature extraction [22], was proposed to compress a high-dimensional and sparse space into a lowdimensional and continuous space.Knowledge embedding can be used to derive new unknown facts from known knowledge bases (e.g., link prediction) and to determine whether a triplet is correct or not (e.g., triplets classification) [23].Moreover embedding representation [24] has been used to support question answer systems [25] and machine reading [26].However, almost all embedding models only use the features and attributes in knowledge graph to represent entities and relations, which omits the fact that entities and relations are projections of the facts in independent space.Besides, almost all of them have high time and memory-space complexities and cannot be used in large-scale knowledge graphs.
In this research, we propose a novel similarity-based knowledge embedding model, namely, SimE-ER, which calculates the entity and relation similarities between two spaces (independent and associated spaces).A sketch of the model framework is provided in Figure 1.The   of this paper is that independent and associated spaces are used to represent the irrelevant and interconnected entities (relations) features, respectively.In independent space, the features of entities (relations) are independent and irrelevant.By contrast, the features of entities (relations) in associated space are interconnected and interacting, and the entities and relations can be denoted by the entities and relations connected with them.Plus, the similarities of the same entities (relations) with different spaces are high.In Figure 1, we can see that, in independent space, the features of  1 are only constructed by themselves, but, in associated spaces, the entity  1 is denoted by other entities and relations which can be described as blue points (lines).We want the features of  1 in independent and associated spaces to be similar.Besides, vector embedding is used to represent knowledge graphs.
In associated space, take as an example the entity which Steve Jobs has multiple triplets, such as (Steve Jobs, Apple Inc., FoundOf ), (Steve Jobs, America, Nationality), and (Steve Jobs, Laurene Powell, CoupleOf ).If we combine all corrupt triplets with the same missing entity, such as (. .., Apple Inc., FoundOf ), (. .., America, Nationality), and (. .., Laurene Powell, Couple), it is easy to locate that the missing entity  3 is Steve Jobs.Similarly, if we combine all the corrupt triplets with the same relation, such as (Steve Jobs, Apple Inc., . ..), (Jack Ma, Alibaba, . ..), and (Sundar Pichai, Google, . ..), we can obtain that the missing relation  1 is FoundOf.The scenario is shown in Figure 2. Hence using correlation between different entities to represent features is an effective method.However, in practice, it is unsuitable to only use the correction between different entities and omit the inherent features entities have, such as the attributes of each entity which are hard to represent with the correlations between different entities.Therefore, we construct the independent space which can preserve the inherent features each entity has.We combine both independent and associated spaces to represent overall features of entities and relations, which can in turn represent the knowledge graph more comprehensively.The motivation of employing both types of spaces is to model correlation while reserving individual specificity.
Compared with other embedding models, vector embedding has evident advantages on time and memory-space complexities.We evaluate SimE-E and SimE-ER on the popular tasks of entity prediction and relation prediction.The experiment results validate the competitive results achieved by the proposed method compared with previous models.
Contributions.To summarize, the main contributions of this paper are as follows: (i) We propose a similarity-based embedding model, namely, SimE-ER.In SimE-ER, we consider the entity and relation similarities of different spaces simultaneously, which can extract the features of entities and relations comprehensively.
(ii) Compared with other embedding models, our model has lower time and space complexity, which improves the effectiveness of processing large-scale knowledge graphs.
(iii) Through thorough experiments on real-life datasets, our approach is demonstrated to outperform the existing state-of-the-art models in entity prediction and relation prediction tasks.
Organization.We discuss related work in Section 2 and then introduce our method, along with the theoretical analysis, in Section 3. Afterwards, experimental studies are presented in Section 4, followed by conclusion in Section 5.

Related Work
In this section, we introduce several related works [19] published in recent years which get the state-of-the-art results.According to the relation features, we divide embedding models into two parts: matrix-based embedding models [27] and vector-based embedding models [28].

Matrix-Based Embedding Models.
In this part, matrices (tensors) are used to describe relation features.
Structured Embedding.Structured Embedding Model (SE) [29] considers that head and tail entities are overlapping in a specific-relation space R  where the triplet (ℎ, , ) exists.It uses two mapping matrices M ℎ and M  to extract feature from ℎ and .
Single Layer Model.Compared with SE, Single Layer Model (SLM) [30] uses a nonlinear activation function to translate the extracted features and considers the features after activation to be orthogonal with relation features.The extracted features are comprised of the entities' features after mapping and a bias of their relation.
Neural Tensor Network.Neural Tensor Network (NTN) [30,31] is a more complex model and considers that the tensor can be regarded as better feature extractor compared with matrices.
Semantic Matching Energy.The basic idea of Semantic Matching Energy (SME) [32] is that if the triplet is correct, the feature of head entity and tail entity is orthogonal.Similar to SLM, the features of head (tail) entity are comprised of the entities' features after mapping and a bias of their relation.There are two methods to extract features, i.e., linear and nonlinear.
Latent Factor Model.Latent Factor Model (LFM) [33,34] assumes that features of head entity are orthogonal with those of tail entity when the head entity is mapped in specificrelation space.Its score function can be defined as   (ℎ, ) = h T M r t, where h, M r , t denote the features of head entity, relation, and tail entity, respectively.

Vector-Based Embedding Models.
In this part, relations are described as vector rather than matrix to improve the effectiveness of representation models.
Translation-Based Model.The basic idea of translationbased model, TransE [23,35,36], is that the relation r is a translation vector between h and t.The score function is   (h, t) = ‖h + r − t‖  1 / 2 , where h, r, and t denote the head entity, relation, and tail entity embeddings, respectively.Because TransE only processes simple relations, other translation-based models [37][38][39]  Convolutional Embedding Model.ConvE [45] transfers the features into 2D space and uses convolutional neural network to extract the entity and relation features.
Compared with matrix-based embedding models, vectorbased models have obviously advantages on time and memory-space complexities.In these vector-based models, TransE is a classical baseline and has been applied on many applications, TransR is an improved method of TransE which solves the complex relation types, and DistMult and ComplEx use probability-based method to represent knowledge and achieve state-of-the-art results.

Similarity-Based Model
Given a training set  + of triplets, each triplet (ℎ, , ) has two entities ℎ,  ∈ E (the set of entities) and relationship  ∈ R (the set of relationship).Our model learns the entities embeddings (h i , t i , h a , t a ) and relationship embeddings (r i , r a ) to represent the feature of entities and relations, where the subscripts ,  denote the independent and associated space.The entity embedding and relation embedding take value in R  , where  is the dimension of entity and relation embedding spaces.

Our Models.
The basic idea of our model is that, for each entity (relation), the features are divided into two parts.The first part describes inherent features of entities (relations) in independent space.The feature embedding vectors can be denoted as h i , r i , t i .The second part signs triplet features in associated space, and the feature embedding vectors can be denoted as h a , r a , t a .In independent space, the feature vectors are described as the inherent features entities (relations) have.In associated space, the features of h a are comprised of other entities and relations which connect with entity h a .
The entities (relations) in associated space are projections of entities (relations) in independent space.Hence the representation features of the same entity in independent and associated space are similar, while the representation features of different entities are not similar.The formula can be described as follows: where ⊙ denotes element-wise product.In detail, in (1), if we combine the features of r a and t a , we can obtain part of the h i features.That is to say, the h i features are similar with r a ⊙t a .In this paper, we use Cosine to calculate the similarity between different spaces.Taking head entity as an example, the Cosine similarity between different spaces can be denoted as where Dot denotes the dot-product and Sum denotes the summation over the vector element.(h i ⊙r a ⊙t a ) calculates the similarity, and ‖h i ‖ ‖r a ⊙ t a ‖ constrain the length of features.To reduce the training complexity, we just consider the numerator and use regularization items to replace the denominator.Hence the similarity of head entity features in independent and graph spaces can be described as We expect that the value of h i ⊙ r a ⊙ t a is larger when h i and r a ⊙t a denote the same head entity, while the value of h i ⊙r a ⊙t a is smaller otherwise.
To represent entities in a more comprehensive way, we consider the similarity of head and tail entities simultaneously.The score function can be denoted as The embedding model based on the similarity of head and tail entities is named as SimE-E.
On the basis of entity similarity, we need to consider relation similarity, which can enhance the representation of relation features.The comprehensive model, which considers all the similarities of entity (relation) features in different spaces, can be described as The embedding model based on the similarity of entity and relation is named as SimE-ER.

Training.
To learn the proposed embedding and encourage the discrimination between golden triplets and incorrect triplets, we minimize the following logistic ranking loss function over the training set: where Θ corresponds to the embeddings h i , h a , r i , r a , t i , t a ∈ R  and  ℎ is a label of triplet. ℎ = 1 denotes that (ℎ, , ) is positive and  ℎ = −1 denotes that (ℎ, , ) is negative. is a triplets set [28] which contains both positive triplets set  + and negative triplets set  − .
The set of negative triplets, constructed according to (9), is composed of training triplets with either head (tail) entity or relation replaced by a random entity or relation.Only one entity or relation is replaced for each corrupted triplet with the same probability.To prevent overfitting, some constraints are considered when minimizing the loss function : Equation ( 10) is to constrain the length of entity (relation) features for SimE-E and SimE-ER.We convert it to the following loss function by means of soft constraints: where  is a hyperparameter to weigh the importance of soft constraints.We utilize the improved stochastic gradient descent (Adagrad) [46] to train the models.Comparing with SGD, Adagrad shrinks learning rate effectively when the number of iterations increases, which means that it is insensitive to learning rate.

Experiments and Analysis
In this section, our models SimE-E and SimE-ER are evaluated and compared with several baselines which have been shown to achieve state-of-the-art performance.Firstly, two classical tasks are adopted to evaluate our models: entity prediction and relation prediction.Then, we use cases to verify the effectiveness of our models.Finally, according to the practical experimental results, we analyze the time and memory-space costs.
4.1.Datasets.We use two real-life knowledge graphs to evaluate our method: (i) WordNet (https://wordnet.princeton.edu/download), a classical dictionary, is designed to describe correlation and semantic information between different words.Entities are used to describe the concepts of different words, and relationships are defined to describe the semantic relevance between different entities, such as instance hypernym, similar to, and member of domain topic.The data version we use is the same as [23] where triplets are denoted as (sway 2, has instance, brachiate 1) or (felis 1, member meronym, catamount 1).A subset of WordNet is adopted, named as WN18 [23].
(ii) Freebase (code.google.com/p/wiki-links), a huge and continually growing knowledge graph, describes large amount of facts in the world.In Freebase, entities are described by labels, and relations are denoted by a hierarchical structure, such as "/V/V  / " and "// /".We employ two subsets of Freebase, named as FB15K and FB40K [23].
We show the statistics information of datasets in Table 2. From Table 2, we see that, compared with WN18, FB15K and FB40K have more relationships and can be regarded as the typical large-scale knowledge graphs.

Experiment Setup
Evaluation Protocol.For each triplet in the test set, each item of triplets (head entity or tail entity or relation) is removed and replaced by items in the dictionary in turn, respectively.Using score function to calculate these corrupted triplets and sorting the scores by ascending order, the rank of the correct entities or relations is stored.For relation in each test triplet, the whole procedure is repeated.In fact, need to consider that some correct triplets are generated in the process of removing and replacement.Hence, we filter out the correct triplets from corrupted triplets which actually exist in training and validation sets.The evaluation measure before filtering is named as "Raw", and the measure after filtering is named as "Filter".We used two evaluation measures to evaluate our approach which is similar to [42]: (i) MRR is an improved measure of MeanRank [23] which calculates the average rank of all the entities (relations) and calculates the average reciprocal rank of all the entities (relations).Compared with Mean-Rank, MRR is less sensitive to outliers.We report the results using both Filter and Raw rules.
(ii) Hits@ reports the ratio of correct entities in Topn ranked entities.Because the number of entities is much larger than that of relations, we take Hits@1, Hits@3, Hits@10 for entity prediction task and take Hits@1, Hits@2, Hits@3 for relation prediction task.
A state-of-the-art embedding model should have higher MRR and Hits@.
Baselines.Firstly, we compare the proposed methods with CP which uses canonical polyadic decomposition to extract the entities and relation features; then we compare the proposed methods with TransE which considers that tail entity features are close to the combined features of head entity and relation.Besides TransR [47], ER-MLP [48], DistMult [41], and ComplEx [43] are also used for comparison with our methods.We train CP [49], DistMult, ComplEx, TransE, and TransR using the codes provided by authors.We choose the length of dimension  among {20, 50, 100, 200}, the weight of regularization  among {0, 0.003, 0.01, 0.1, 0.5, 1}, the learning rate among {0.001, 0.01, 0.1, 0.2, 0.5}, and the ratio of negative and correct samples  among {1, 5, 10, 50, 100}.The negative samples in different epochs are different.
Implementation.For experiments using SimE-E and SimE-ER, we select the dimension of the entity and the relation  among {50, 100, 150, 200}, the weight of regularization  among {0, 0.01, 0.1, 0.5, 1}, the ratio of negative and correct samples  among {1, 5, 10, 50, 100}, and the mini-batch size  among {100, 200, 500, 1,000}.We utilized the improved stochastic gradient descent (Adagrad) [46] to train the loss function.With the iteration epoch increasing, the learning rate in Adagrad is decreases, and Adagrad is insensitive to learning rate.The initial values of both SimE-E and SimE-ER are generated by Random function, and the range is (−6/ √ , 6/ √ ), where  is the dimension of feature vector.Training is stopped using early stopping on the validation set MRR (using the Filter measure), computed every 50 epochs with a maximum of 2000 epochs.
In SimE-E model, the optimal configurations on validation set are

T-test.
In experiments, for each model, we run 15 times independently and calculate the mean and standard deviation.Then we use Student's t-test with  − V = 0.95 to compare the performance between different models, and the t-test can be shown as follows [50,51].
1 and  1 are mean and standard deviation on model 1 with run  1 times;  2 and  2 are mean and standard deviation on model 2 with  2 times.Then we can construct the hypothesis: And the t-test can be described as The degree of freedom () in t-distribution can be shown as follows: In entity and relation prediction tasks, we calculate mean and standard deviation of MRR and Hit@ and compare their performance with t-test.

Link Prediction.
For link prediction [52][53][54], we tested two subtasks-entity prediction and relation prediction.Entity prediction aims to predict the missing ℎ or  entity from the fact triplet (ℎ, , ); similarly, relation prediction is to determine which relation is more suitable for a corrupted triplet (ℎ, * , ).
Entity Prediction.This set of experiments tests the models' ability to predict entities.Experimental results of mean and plus/minus standard deviation on both WN18 and FB15K are shown in Tables 3, 4, and 5, and we can observe the following: (i) On WN18, a small-scale knowledge graph, ComplEx achieves state-of-the-art results on MRR and Hits@.However, on FB15K and FB40K, two large-scale knowledge graphs, SimE-E and SimE-ER achieve excellent results on MRR and Hits@, and the values of Hits@10 are up to 0.868 and 0.889, respectively.The outstanding results prove that our models can represent different kinds of knowledge graphs effectively, especially on large-scale knowledge graphs.(iv) Compared with DistMult, the special case of our models, SimE-E and SimE-ER achieve better results, especially on FB15K, and the filter MRR is up to 0.740.The results can prove that our models which use irrelevant and interconnected features to construct independent and associated spaces can represent the entities and relations features more comprehensively.
We use t-test to evaluate the effectiveness of our models, and the evaluation results can prove that on FB15K  6, 7, and 8 show the prediction performance on WN18 and FB15K.From the tables, we discover the following: (i) Similar to the results in the entity prediction, on WN18, ComplEx achieves better results on MRR and Hits@1, and SimE-ER obtains better results on Hits@2 and Hits@3.On FB15K, besides the value of Hits@1, the results of SimE-ER are better than ComplEx and other baselines, and the value of Hits@3 is up to 0.842, which is much higher (improvement of 20.1%) than the state-of-the-art baselines.ON FB40K, SimE-ER achieves state-ofthe-art results on all the measures; in particular, the filter MRR is up to 0.603.
(ii) In entity prediction task, the results of SimE-E and SimE-ER are similar.However, in relation prediction tasks, SimE-ER achieves significant results on Raw MRR, Hits@2, and Hits@3.We use the t-test to verify the results, and the t-values are larger than (iii) On FB15K, the gap is significant and SimE-E and SimE-ER outperform other models, with a MRR (Filter) of 0.593 and 0.842 of Hits@3.On both datasets, CP and TransE perform the worst, which illustrates the feasibility of learning knowledge embedding in the first case and the power of using two mutual restraint parts to represent entities and relations in the second.
We also use t-test to evaluate our model; i.e., comparing SimE-ER with ComplEx on filter MRR,  = 35.72,which is larger than  0.95 (28) = 1.701.The t-test results can prove that the performance of SimE-ER is better than other baselines on FB15K and FB40K.
To analyze the relation features,  (ii) On FB15K, the time costs of SimE-E and SimE-ER in each iteration are 5.37s and 6.63s, respectively, which are lower than 7.53s, the time cost of TransE which has fewer parameters.The reason is that the minibatch of TransE is 2415 which is much larger than the mini-batches of SimE-E and SimE-ER.Besides, for SimE-E and SimE-ER, the number of iterations is 700 times with 3760 (s) and 4642 (s), respectively.
(iii) Because SimE-E and SimE-ER have low complexity and high accuracy, they can easily be applied to largescale knowledge graph, while using less computing resources and running time.

Conclusion
In this paper, we propose a novel similarity-based embedding model SimE-ER that extracts features from knowledge graph.SimE-ER considers that the similarity of the same entities (relations) is high in independent and associated spaces.Compared with other representation models, SimE-ER is more effective in extracting the entity (relation) features and represents entity and relation features more flexibly and comprehensively.Besides, SimE-ER has lower time and memory complexities, which indicates that it is applicable on large-scale knowledge graphs.In experiments, our approach is evaluated on entity prediction and relation prediction tasks.The results prove that SimE-ER achieves state-of-theart performances.We will explore the following future work: (i) In addition to the facts in knowledge graph, there also are large amount of logic and hierarchical correlations between different facts.How to translate these hierarchical and logic information into low-dimensional vector space is an attractive and valuable problem.

(
ii) In real world, extracting relations and entities from large-scale text information is an important yet open problem.Combining latent features of knowledge graph and text sets is a feasible method to construct the connection between structured and unstructured data.It is supposed to enhance the accuracy and efficiency of entity (relation) extraction.
Figure 1: Framework of our model.

Table 1 :
Complexities of representation models.

Table 2 :
Dataset statistics.SimE-ER can dynamically control the ratio of positive and negative triplets.It enhances the robustness of representation models.(iii) Compared with SimE-E and SimE-ER, DistMult is a special case of them when we only consider single similarity of entity or relation.That is to say, SimE-E and SimE-ER can extract the features of entities (relations) more comprehensively.
3.3.Comparison with Existing Models.To compare the time and memory-space complexities between different models, we show the results in Table1, where  represents the dimension of entity and relation embeddings,  is the number of tensor's slices, and   and   are the numbers of entities and relations, respectively.The comparison results are showed as follows:(i) Except for DistMult and TransE, the baselines use relation matrix to project entities' features into relation space, which makes these models have high memory-space and time complexities.Compared with these models, SimE-E and SimE-ER have lower time complexity.SimE-E and SimE-ER can be used on large-scale knowledge graphs more effectively.(ii)In comparison to TransE, SimE-E and

Table 3 :
Experimental results of entity prediction on WN18.

Table 4 :
Experimental results of entity prediction on FB15K.

Table 5 :
Experimental results of entity prediction on FB40K.SimE-E and SimE-ER are better than ComplEx.The reason is that the number of relations is much larger than WN18, and the relation structure is more complex and hard to represent, which has obvious influence on the representation ability of ComplEx.(iii) The results of SimE-E and SimE-ER are similar to each other.The largest margin is filtered MRR on FB15K at 0.013.The phenomenon demonstrates that both SimE-E and SimE-ER can extract the entity features in knowledge graph and predict the missing entities effectively.

Table 6 :
Experimental results of relation prediction on WN18.

Table 7 :
Experimental results of relation prediction on FB15K.

Table 8 :
Experimental results of relation prediction on FB40K.

Table 9 :
MRR for each relation on WN18.
Table9shows the MRR with Filter of each relation on WN18, where # denotes the number of triplets for each relation in the test set.From Table9, we conclude the following: Table10shows the detailed prediction results on test set of FB15K.It illustrates the performance of our models.Given head and tail entities, the top-5 predicted relations and relative scores of SimE-ER are depicted in Table10.From the table, we observe the following: (ii) Compared with SimE-E, the relation MRRs of SimE-ER are much better on most relations, such as hypernym, hyponym, and derivationally related form.(iii) On almost all results of relation MRR, SimE-ER is better than DistMult, a special case of SimE-ER.That is to say, compared with single embedding space, using two different spaces to describe entity and relation, features can achieve better performance.Case Study.+ GeForce GTX TITAN.We report the average running time over one hundred iterations as the running time of each iteration.From Table 11, we observe the following: (i) Except for DistMult, SimE-E and SimE-ER have lower time and memory complexities compared with

Table 10 :
Case study of SimE-ER.