Representation Learning of Knowledge Graphs with Embedding Subspaces

. Most of the existing knowledge graph embedding models are supervised methods and largely relying on the quality and quantity of obtainable labelled training data. The cost of obtaining high quality triples is high and the data sources are facing a serious problem of data sparsity, which may result in insuﬃcient training of long-tail entities. However, unstructured text encoding entities and relational knowledge can be obtained anywhere in large quantities. Word vectors of entity names estimated from the unlabelled raw text using natural language model encode syntax and semantic properties of entities. Yet since these feature vectors are estimated through minimizing prediction error on unsupervised entity names, they may not be the best for knowledge graphs. We propose a two-phase approach to adapt unsupervised entity name embeddings to a knowledge graph subspace and jointly learn the adaptive matrix and knowledge representation. Experiments on Freebase show that our method can rely less on the labelled data and outperforms the baselines when the labelled data is relatively less. Especially, it is applicable to zero-shot scenario.


Introduction
ere are various knowledge graphs constructed with great effort, such as Freebase [1] and WordNet [2], which have become the fundamental techniques for many intelligent applications such as web search and question answering [3]. However, the validity and integrity of the knowledge graphs cannot always be guaranteed. For instance, Freebase [1] currently contains entities over 80 million and thousands of relations as well as obtaining billions of facts about these entities. However, obviously, these are still only a small part of all human knowledge. In fact, the ability of Q&A system engine based on knowledge graphs is limited if it could not infer and fill in the missing facts from the obtained knowledge. erefore, knowledge reasoning methods predicting new facts only based on the knowledge existing in knowledge graphs are desired. It is a key compensation for extracting relations from flat text.
Knowledge representation is the basis of knowledge reasoning. For example, using graph-based knowledge representation, to compute or infer a semantic relationship between entities needs to design specific graph-based algorithms. Knowledge graphs represent entities as nodes and relations as different types of edges in the form of a triple (head entity, relation, tail entity) [4]. Graph-based knowledge representation is facing many challenges, e.g., computing efficiency and data sparseness. In recent years, great progress has been made in the knowledge representation learning method based on embedding technology [5]. Embedding learning is to represent the entities and relations in knowledge graphs as lowdimensional dense real value vectors and embed the knowledge graph into a continuous vector space while keeping the structure characteristic of knowledge graphs [6]. Usually, in this low-dimensional dense real value vector space, the nearer the distance of two vectors, the higher the similarity of their semantics. Since the semantic relation of entities and relations can be calculated in the low-dimensional space in a highly efficient way and the problem of data sparseness can be resolved dramatically, the performance of knowledge extracting, fusion, and reasoning is greatly improved.
Most of the existing embedding learning methods are supervised methods and generally use the obtained structured knowledge to train models [7,8]. e success of the supervised system largely depends on the quantity and quality of obtainable training data; it is sometimes even more important than the choice of specific learning algorithms. e cost of obtaining high quality structured knowledge is high and the knowledge graph obtained is facing a serious problem of data sparseness. Long-tail entities cannot be efficiently trained. On the other hand, unstructured text data involving relevant entities and relations information can be obtained easily in large quantities. Word vector representations of entity names can be obtained through cooccurrence mode of words in large quantities of an unlabelled text corpus.
e word vectors estimated from the raw text through the natural language model are the low-dimensional dense vectors containing syntax and semantic attributes of words [9]. erefore, since these vectors are obtained by minimizing the prediction errors in the common unsupervised tasks, they might not be the best for knowledge graphs.
is paper proposes a knowledge representation learning approach in which the unsupervised word vectors are adapted to knowledge graph subspaces with a small number of labelled data. e intuition behind our approach is the following. For a specific task, only partial latent aspects captured by the word vectors will be useful. Hence, instead of updating the word vectors directly with available labelled data, we estimate a projection of these vectors into a lowdimensional subspace.
is simple method has two key advantages. One is that we get low-dimensional vectors, which are suitable for complex knowledge representation learning tasks. Another is that we can learn new vectors of all entities even if they are missing in labelled data.

2.1.
Structure-Based Knowledge Embedding. Structure-based embedding learning models learn the entity and relation vector representations through structure information located in triples of the knowledge graph. Most existing embedding methods belong to this category.
Most methods of this kind have been designed within the framework of relational learning from latent features by operating on latent representations of entities and relations, such as models based on collective matrix factorization [10,11] or tensor factorization [7,12,13]. Many approaches have focused on increasing the expressivity and the generality of the model in energy-based frameworks for learning embeddings of entities in low-dimensional spaces [14][15][16]. e greater expressivity of these models comes at the expense of substantial increases in model complexity and in higher computational costs.
Compared to the early embedding models, TransE [8] is simpler and more effective. TransE regards the relations in the knowledge graph as certain translation vectors. For a triple (h, r, t), where h represents a head entity, and r is a relationship that connects h to a tail entity t. TransE use relation vector r as the translation between the head entity vector h and the tail entity vector t. erefore, TransE is also referred to as translation model. e basic idea of this model is to treat relation r as the translation between entities associated with r. If relation (h, r, t) is established, h + r ≈ t in the vector space, that is t should be the closest vector of h + r.
Otherwise, h + r should be far away from t.
Later knowledge embedding learning models are mostly the extension based on TransE, such as TransH [17], TransR [18], TransD [19], and TransA [20]. PTransE [21] proposes a multiple-step relation path based representation learning model. TranSparse [22] deal with the heterogeneity and the imbalance of knowledge graphs with adaptive sparse matrices. e recently proposed ProjE [23] achieves the stateof-the-art performance of this branch with relatively less model parameters.
Luo et al. [24] propose a two-stage embedding scheme to improve the performance of structure-based embedding models, such as TransE, SME, and SE. It first uses a word embedding model to learn initial embeddings of entities and relations from relation paths, viewing entities and relations in the path as pseudowords. RDF2Vec [25], metapath2vec [26] and Hussein et al. [27] transform the graph data into sequences of entities and use unsupervised language model to learn entity representations considering sequences of entities as sentences. However, these method still only utilize the structure information. Recently, Dettmers et al. [28] introduce a multilayer convolutional network model, ConvE, for link prediction, which uses 2D convolution over embeddings and multiple layers of nonlinear features to model knowledge graphs.

Knowledge Embedding With Multisources Information.
Structure-based knowledge embedding learning models only utilize the triple structure information of the knowledge graph and a large amount of other related information are not efficiently used yet, such as the descriptions and categories of entities and relations.
ere are some studies on using the above information to learn knowledge representation. NTN [7] represents an entity as the average of its word embeddings in entity name. Wang et al. [29] align the embeddings of entities and words in the same space by utilizing entity names and Wikipedea anchors. Recently, DKRL [30] extends TransE considering text information of entity descriptions provided by knowledge bases (i.e., knowledge graph), and building entity vector representations with CNN model based on entity descriptions, which can model the word sequence information in text. When a new entity that is not in the knowledge base occurs, DKRL can generate entity vector based on its simple description. SSP model recently proposed by Xiao et al. [31] learned entity semantic vectors from entity description text by using topic model, then projected the structural loss to the semantic space codetermined by the head entity and tail entity to learn vector representations of entities and relations. e learning process of SSP model is more closely related to text information.
Most of the above models mainly use the text information of entity names and entity descriptions. is limits the use of abundant unstructured text information in the Web.

Knowledge Representation Learning Based on Subspace Projection
Some work [7,[29][30][31] shows that learning knowledge representation through fusing multisource information can efficiently improve the performance of knowledge representations. Abundant Web text contains large quantity of unstructured knowledge related to entities. Word embedding is a kind of useful unsupervised technology and it can offer a simplified real value vector representation for each word from unlabelled free text. We use word embedding technology to obtain vector representations of entity names from abundant Web text and then adapt these vectors to a subspace which is suitable for representations of entities in the knowledge graph through projection. erefore, knowledge representation learning can be divided into two stages: unsupervised estimation of entity name vectors and jointly supervised learning of the subspace adaptive matrix and the knowledge representation.

Estimation of Unsupervised Entity Vectors.
We obtain vector representations of entity names in knowledge graphs from unlabelled Web text through unsupervised word vector learning technology and regard it as the initial vector representation estimations of entities. Word vectors are usually trained through optimizing an objective function with unlabelled data [9,[32][33][34]. CBOW [34] and skip-gram [9] learn word vectors capturing many syntactic and semantic relations between words. us, in this study we use skipgram [9] to learn the word vector representations of entity names.
Given a sequence of training words w 1 , w 2 , . . . , w T , the optimizing objective of skip-gram is to maximize the average log probability: where c is the size of training context and p(w t+j | w t ) is defined using softmax function: where v w is the vector representation of word w and W is the number of words in the vocabulary. e same as other majority of neural network model, skip-gram adopts a training method based on gradient descent. e trained model, embedding vectors v w ∈ R e×1 , encloses information of each word w and the context around it. erefore, these vectors can be used as input of other learning algorithms to improve performances further.

Embedding Knowledge Graphs into Subspaces.
As previously mentioned, word embedding is a kind of useful unsupervised technology to obtain the initial feature vector of entity names before supervised training. ese initial vector representations can be retrained with obtainable labelled data. However, the knowledge graph has serious problem of data sparseness. e quantity of entities in the database is large; however the quantity of high quality triple data related to each entity that can be obtained is relatively small with high cost. Only small quantity of supervised data causes serious overfitting. Furthermore, it is likely that only a subset of entities appears in the training triples and the vector representations of the entities missing in the training triples will never be updated. We propose a simple solution to avoid this problem.
We use W E ∈ R e×v to denote the initial entity vector matrix obtained by skip-gram as stated in the previous section. We define the adapted embedding matrix W A in subspace as the following factorization: where W S ∈ R s×e , s ≪ e. Next, we estimate parameters of the matrix W S by using the triples (labelled dataset) in the knowledge graph and keep W E fixed. at is to say, we find the best mapping matrix W S that projects the initial vector matrix W E into the subspace with dimension s. e idea of embedding knowledge graphs into a subspace is based on the following two key principles: (1) With reduction of embedding dimension, the model can better fit the complexity of the knowledge graph task and the amount of obtainable labelled data. (2) rough projection, all the entity vectors are indirectly updated, not only those of entities that appear in training triples.

Jointly Learning Model.
After obtaining the initial vectors of entities, we use a supervised model to jointly learn the projection matrix and knowledge representation in subspace based on the idea of subspace projection. e jointly learning model utilizes the structure information of triples existing in the knowledge graph. e concept of embedding subspace can be applied to any structure-based knowledge representation learning models. Since currently ProjE gains the best performance in relation to reasoning task and the parameters are relatively less, we use this model together with the thought of embedding subspace as a supervised training model. is method is hereinafter referred to as sub-ProjE. Let n e , n r , e and s be, respectively, the number of entities, relations, unsupervised entity vector dimension, and subspace dimension.
e number of parameters of ProjE is n e × s + n r × s + 5s. e number of parameters of sub-ProjE is s × e + n r × s + 5s. Since e ≪ n e , the parameter size of sub-ProjE is much smaller than that of ProjE.
We consider the relation reasoning as an entity ranking problem, which takes a partial triple as input and produces a ranked list of candidate entities as output.
Similarly, we can easily substitute (h, r, ?) for (?, r, t). e key thought of ProjE is as follows: given two input vectors, regard prediction task as a ranking problem, keep the target of optimizing as the overall order of candidate entities, in which the entities in the front are correct entities. To generate this ordered list, we project each candidate vector to the objective vector defined by a combination operator and two input vectors.
e combination operator is defined as follows: where e and r are the representations of entities and relations in embedding space, s is the dimension of the embedding space, D e and D r is the diagonal matrix of s × s, which serve as global entity and relation weights, respectively, b c ∈ R s is the combination bias.
With this combination operator, we can define vector project function as follows: where f and g are activation functions, W c ∈ R c×s is the candidate entity matrix, b p is the projection bias, and c is the quantity of candidate entities. h(e, r) represents the ranking score vector, in which each element represents the similarity between candidate entity in W c and the combined input vector. W c is the candidate entity matrix which contains c rows that exist in the entity vector matrix W E (i.e., W S · W E in knowledge graph subspace). So, W c does not introduce any new variables into the model. e model can be regarded as a neural network that contains an entity vector projection layer, a combination layer, and an output projection layer. Figure 1 explains the architecture of this model through an example with input (?, City Of, Illinois). Given a tail entity Illinois and a relation City Of, our task is to calculate the scores of each head entity. In order to make it clear, we only demonstrate two candidate entities in Figure 1. However, in fact W c may contain candidate entities of any quantity.
Compared to a conventional knowledge embedding model, this model has two main differences. First, the input layer is factorized into two components, the initial vector representations attained in unsupervised stage, W E , and the projection matrix W S . Second, the size of the subspace, in which the initial vectors are projected, is much smaller than that of the initial embedding space with typical reductions above one order of magnitude. Same as the usual neural network model, all the parameters can be trained with gradient descent methods through backpropagation.

Ranking Method and Loss Function.
Following ProjE, we construct a binary label vector, in which all entities in E − have a value of 0 and all entities in E + have a value of 1, then maximize the likelihood between ranking score vector h(e, r) and the binary label vector. e loss function is defined as follows: where e and r are the vector representation of input training sample, y ∈ R c is the binary label vector, where y i � 1 means candidate entity i is positive, the objective probability of a positive candidate (objective value) is 1 divided by the total number of positive candidates. We regard softmax and tanh function as g(·) and f(·), respectively, then the ranking score of the ith candidate entity is as follows: where W c [i,:] represents the ith candidate in the candidate entity matrix.

Algorithms.
Since the quantity of candidate entities (i.e., rows of W c ) is large, we use candidate sampling to reduce the number of candidate entities in the training phase. Given an entity e, a relation r and a binary label vector y, we calculate projections of all positive candidates. For negative candidates, we only calculate projections of a sampled subset. We take negative candidate samples based on the binomial probability distribution B(1 − p y ), in which p y is the probability of a negative sample which might be sampled, 1 − p y is the probability of a negative sample that is sampled. For each negative candidate in y, we sample a value from B(1 − p y ) to determine whether this candidate is included in the candidate entity matrix W c or not.
e complete training process is demonstrated in Algorithm 1. Given training triples T, we first choose at random to replace head entity or tail entity to construct real training dataset and then generate positive and negative samples from T according to sampling strategy. Next, calculate loss and update parameters for each minibatch in the newly generated training dataset. Among this, ∘ is Hadamard product and × is matrix product.

Experiments
We evaluate our model with entity prediction tasks and compare the performance against the native ProjE using experimental procedures, datasets, and metrics established in the related work. We also give the results of TransE [8] and TransH [17] implemented by [18] [8] is a subset of Freebase [1], which contains 592,213 triples and involves 14,951 entities and 1345 relations. e initial entity vector is indicated with the entity name vector unsupervised pretrained by word2vec [9] with a large-scale web corpus. e dimension of the initial entity vector is 1000. In order to ensure each entity has a pretrained initial vector representation, we deleted 1423 entities without entity name vectors from FB15K and removed triples about these entities accordingly. e finally retained training set includes 364,424 triples, validation set includes 37,905 triples and the test set includes 44,565 triples, which totally involves 13,528 entities and 1345 relations.
FB15K-237 introduced by Toutanova and Chen [35] is a subset of FB15K where inverse relations are removed. We remove entities without entity name vectors too, the same as FB15K.
For zero-shot scenario, we divide entities into two groups: training entities (10,000) and test entities (3,528), while ensuring the training set and validation set only contains triples whose head entity and tail entity are both in training entity group, and the triple in test set has at least one entity in a test entity group. Finally, the FB15K dataset for zero-shot scenario has 201,272 training triples, 20,968 validation triples, and 3,012 test triples, respectively. e FB15K-237 dataset for zero-shot scenario has 117,785 training triples, 12,196 validation triples, and 1,762 test triples, respectively. Table 1 shows the statistical properties of datasets.

Parameter Setting.
In the supervised training phase, we apply the default setting the same as ProjE: using Adam [36] as the stochastic optimizer with hyperparameter settings of β 1 � 0.9, β 1 � 0.999, ε � 1e − 8 ; L 1 regularized to all parameters during the training and dropout layer on top of the combination operator to avoid overfitting. Other parameters are set as follows: learning rate l r � 0.01, batch size b � 200, regularized parameter α � 1e − 5 , dropout probability p d � 0.5, negative sample sampling probability p y � 0.5.

Evaluation Protocol.
In the entity prediction tasks, we predict the missing head entity or tail entity in the triples by ranking all of the entities in the knowledge graph. Given a test triple (h, r, t), we remove the head or tail entity and then replace it with each entity in the knowledge graph and calculate the ranking score, then rank these replacing entities in descending order and record the ranking of the right entities. Following [8], we use mean rank, HITS@k, filtered mean rank, and filtered HITS@k as our evaluation metrics. Table 2 shows the results of different models in entity prediction tasks trained on FB15K with different training set sizes. We can see from Table 2 that the Mean Rank of sub-ProjE is dramatically superior to that of ProjE, TansH, and TransE when the training data becomes less. Both sub_ProjE and ProjE outperform TransE and TransH.

Scientific Programming
Since the training data is less, the filtered Mean Rank is similar to the original Mean Rank. When the training set is very large, the performances of ProjE, TransE, and TransH are superior to that of sub-ProjE, which verifies the fact that sub-ProjE is applicable to less training data scenario. HITs@ 10 of ProjE is superior to sub-ProjE. is is because ProjE only partially updates the entity vectors during training. erefore, partially the accuracy of ProjE is higher than that of sub-ProjE, but on the whole, the performance of sub-ProjE is superior to that of ProjE. Table 3 shows the results of different models in entity prediction tasks trained on FB15K-237 with different training set sizes. e results are similar to the results of FB15K.
In the zero-shot scenario, at least one entity of a tested triple is not in the knowledge graph (i.e., the training set). e original ProjE method cannot deal with this situation since it cannot generate entity representation that is not in the knowledge graph. TransE and TransH cannot apply to zero-shot scenario too for the same reason. Our sub-ProjE method is capable of dealing with this situation because it indirectly updates the missing entity representation during training. Table 4 shows the results of zero-shot scenario experiments on FB15K and FB15K237. We can see from Table 4 that sub-ProjE can deal with the entity prediction in zero-shot scenario even when the training data is very limited.  (4) for (h, r, t) ∈ T do/ * construct training data using all train triples * / (5) e←random(h, t) (6) if e � � h then/ * tail is missing * / (7) T h .add([e, r]) (8) C h .add( t′ | (h, r, t′) ∈ T ∪ sample(E, p y ))/ * all positive tails from T and some sampled negative candidates * / (9) else/ * head is missing * / (10) T t .add([e, r]) (11) C t .add( h′ | (h′, r, t) ∈ T ∪ sample(E, p y ))/ * all positive heads from T and some sampled negative candidates * / (12) end if (13) end for (14) for each ( 6 Scientific Programming With the increase of training data, the performance of the method is further improved. In order to further analyze the stability of sub-ProjE model, we give the mean rank and HITs@10 results of the first 30 iterations on FB15K in a common scenario and zero-shot scenario, as Figures 2 and 3 show. We can see from Figure 2 that the larger the training dataset, the faster the sub-ProjE model converges. e performance of sub-ProjE model becomes stable after the first several iterations when all the training dataset is put into training. When the data is less, sub-ProjE converges slower and becomes stable after over ten times of iteration. e convergent speed of the zero-shot scenario is the same as the general scenario. We get similar results on FB15K-237.
is shows sub-ProjE model is applicable to the zero-shot scenario.

Conclusions
is paper proposes a new knowledge representation learning method utilizing unsupervised entity name vectors. e basic idea is to seek the subspace projection of unsupervised entity vectors in knowledge representation tasks.
is method allows indirect update of entity vectors that do not appear during the process of training and applicable to the case that only a few labelled data can be obtained. Experiments on Freebase verify the effectiveness of this method. Results show that the performance of this simple method surpasses the best existing knowledge representation learning model in case the training data is less, and furthermore, it can be applied to zero-shot scenarios.
Data Availability e datasets used in this paper to produce the experimental results are publicly available. FB15k and FB15k-237 can be downloaded from http://openke.thunlp.org. e unsupervised pretrained entity vectors can be downloaded from http://code.google.com/p/word2vec.