A Renovated CNN-Based Model Enhances KGC Task Performance

Knowledge graph (KG) contains a large number of real-world knowledge and has become an invaluable aid to assist the application of artificial intelligence. Knowledge graph completion (KGC) is the task to complete the missing triple in KG database. Our goal in this study is to enhance the performance of KGC tasks based on CNN model. To do this, we first investigated the effect of adding multiple filters of different shapes into the pioneer model. The obscure improvement leads us to seek other approaches. Our second proposed model, termed DP-ConvKB, which is a deep convolution-neural-network-based model, outperforms state-of-the-art models on several metrics. Our study provides supporting evidence that, by cooperating deep pyramid network structure into models, it can significantly improve the KGC performances.


Introduction
Since the development of the Semantic Web in the 1980s, knowledge graph (KG), as a concept of knowledge base, serves as a surrogate of the real-world information, which focuses on describing the relationships between interlinked entities. Based on the graph-structured data model of integrating data, KG uses a triple (head entity, relation, and tail entity) to store interlinked descriptions of entities. Thanks to its structure-based and machine-readable attributes, the application of KG proliferates in various domains, such as search, analytics, and recommendation. In 2012, search engine giant Google launched their KG [1], and by May 2020, Google KG had grown to 500 billion facts on 5 billion entities [2]. However, information incompleteness is a problem that always exists in all KGs, including massive KGs. For example, in Freebase and DBpedia, more than 66% of the person entities are missing a birthplace [3,4]. These missing facts not only affect the KG structure itself but also would sometimes result in inferring misleading information. Therefore, knowledge graph completion (KGC) researchers aim to identify missing links under the supervision of the existing KG. The specific work scope of KGC includes finding and mining missing entities and relationships [5], link prediction [6], and inferring new facts [7].
The key problem of knowledge graph completion is how to represent and model the combination of entities and relations. Data representation and representation learning are known to play a pivotal role in the development of KGC [8]. Embedding method is a popular data representation technique, which generally converts the entities and relations into low-dimensional vectors or matrixes. For example, in TransE [9], the candidate triple ðh, r, tÞ is used for representing a valid fact, and then, the relation r corresponds to a translation between the embeddings of the head entity h and the tail entity t, that is, h + r ≈ t. Several succeeding transition-based models are proposed with the basic idea of TransE, such as TransH [10], TransD [11], TransR [12], and TransG [13]. Among them, TransH models a relation as a hyperplane and TransR divides the workspace into entity space and relation space during representing the relations and entities. In addition, a trilinear dot product has also been applied to compute the score for each triple; DistMult [14] and ComplEx [15] are two typical such linear models.
KGC models can be generally divided into embeddingbased models, linear models, and neural network models. Neural networks (NNs) have been widely used in machine learning tasks such as pattern recognition and perception science [16] and also gained more attention in the field of KG recently. By cooperating with convolution algorithm, convolutional neural networks (CNNs) are a specialized type of NNs that have been heavily used in computer vision since its birth. Such popularity of CNN-based applications in computer vision is because the neural-network-based structure of CNNs can match the composition of images natively. Since Kim proposed TextCNN in 2013 [17], CNNs have also begun to receive numerous attentions in the field of natural language processing (NLP), such as sentence classification [18], sentence modeling [17], and search query retrieval [19]. Most of these models adopt the convolution layer similar to TextCNN to extract features in the embedding. Inspired by computer vision, Dettmers et al. proposed the first CNN-based model for the KGC task-ConvE [20]. Following ConvE, Nguyen et al. proposed the model ConvKB [21], where the triple ðh, r, tÞ is represented as k-dimensional vectors V h , V r , V t , and the input matrix combines these vectors into a k × 3 matrix. In the convolution layer, different filters with the same shape of 1 × 3 are designed to explore the global features among the same dimensional units of the embedding triple ðv h , v r , v t Þ.
Although the model ConvKB has overcome the limitation of ConvE and obtains better link prediction results than the existing models on several benchmark datasets, there are still several unsolved problems. For example, limited by the convolution kernel of fixed size, long-distance interaction between different positions in embedding vectors cannot be extracted, but only shallow information from the single convolutional layer. Therefore, multiple methods have been proposed to solve this problem, such as CapsE [22] and ConvR [23]. In this paper, we applied a deep pyramid convolutional network structure with the original ConvKB model and designed a new model named deep pyramid ConvKB (DP-ConvKB).
We summarize the main contribution of our work as follows.
To improve KGC task performance, we first attempted it with a tentative model by introducing multiple complex filters to the ConvKB, hereafter referred to as multifilter ConvKB (MF-ConvKB). Compared to other existing CNNbased models, including ConvKB, experiment results on two benchmark datasets, FB15k-237 and WN18RR, showed that MF-ConvKB could only bring about mild improvements on some specific metrics. However, it motivated us to seek for other possible model structures which might potentially help to overcome the limit of improvement. We found that by incorporating a deep pyramid network structure into ConvKB, the new-designed model DP-ConvKB could significantly improve the KGC performances on several metrics.

Related Work
2.1. Embedding-Based Model. Embedding-based TransE was the first model using transseries methods for modeling multirelational data [9]. Inspired by the Word2Vec Skipgram model, the TransE maps the relationship in the triple to the translations in latent feature space, denoted as h + r ≈ t. However, the TransE model can only work well on the instances with one-to-one relation due to its lower parameter complexity and fails to deal with complex relations, such as one-to-more, more-to-one, and more-tomore. To enhance the scope ability and the efficiency of the score function of the previous models, TransH [10] is proposed to model the complex relations as a hyperplane together with a translation operation on it; however, this model still lacks of adaptation to scenarios with multirelations. TransD [11] and TransR/CTransR [12] represent entities and relations in separate spaces and project each entity with a relation-specific matrix. Unlike TransR, STransE [24] and TranSparse [25] tie the head and tail entities together with their own projection matrices. In addition, PTransE [26] and PTransR [27] take the relation path information into consideration when representing the triple ðh, r, tÞ. Recently, aiming at expanding the embedding space, embedding-based models like RotatE [28] continue to play a big role in the KGC task field. [29] is a typical bilinear model obtaining the underlying semantics information of the representation of the entity, but this model is prone to overfitting because of the large amounts of parameters. DistMult [14] can be considered as a special case of RESCAL, in which DistMult represents each relation as a diagonal matrix rather than a full matrix to avoid overfitting. ComplEx [15] and SimplE [30] can be viewed as direct extensions of the Dis-tMult, in which ComplEx applies the complex domain to handle a variety of binary relations, and in SimplE, the subject and object embeddings for the same entity are learned dependently. [20] and ConvKB [21] both are CNN-based models, and ConvE is the first model with CNNs applied for the KGC task. In ConvE, various filters with the fixed shape are operated over the input matrix-reshaping and concatenating of the head entity and relation embeddings. The design of the input is straightforward and one sided, which ignores the global information among the same dimensional position of the representation of the whole triple. ConvKB was proposed to deal with the problem from ConvE, replacing the reshaping operation with a convolution layer over the embedding triple ðh, r, tÞ. CapsE [22] applies the capsule network for the KGC task by adding a capsule network layer instead of the convolution layer at the base of the ConvKB. SACN [31] and R-GCN [32] are graph convolutional network-(GCN-) based models, and DOLORES [33] introduces long short-term memory (LSTM) to KGC tasks. In addition, KBGAT [34], GAATs [35], ConvR [23], and CoPER-ConvE [36] are effective methods which are proposed in recent years. Collectively, incorporating NNs has surfaced to become the mainstream way to solve the problem of the incomplete knowledge graph. Table 1 illustrates the score function and the optimization methods of each related work.

Wireless Communications and Mobile Computing
Deep CNNs are commonly used to extract depth features from images, and a number of powerful models have emerged, such as LeNet [37], AlexNet [38], ResNet [39], and Google Inception Net [40]. Inspired by ResNet, Johnson and Zhang proposed a low-complexity word-level deep convolutional neural network architecture for text categorization-DPCNN [41]. In this paper, we proposed a novel neural network model, namely, DP-ConvKB, which takes the advantage of deep convolution structure, to improve the performance of KGC.

The Construction of Proposed Models
As we mentioned above, we attempted two different approaches to improve the KGC task performances by ConvKB. In the first model, MF-ConvKB, we incorporated multiple filters of different shapes into ConvKB. In the second model, DP-ConvKB, we utilized a deep pyramid convolutional network to improve the KGC task performance. In this section, we present the procedures for constructing these two models.
3.1. MF-ConvKB. An incomplete KG ζ collects the facts in the form of ðh, r, tÞ, with h, t ∈ E, r ∈ R, where E and R denote the sets of entities and relations, respectively. Each triple is represented as a unique embedding group ðv h , v r , v t Þ collecting with K-dimensional vectors, and all these vectors are concatenated into a matrix A ∈ ℝ k×3 and A i ∈ ℝ 1×3 represents the i-th row of A. For example, a filter ω ∈ ℝ h×3 is applied to the window consisting from A i to A i+h−1 to generate a feature c i , which is defined as follows: where b ∈ ℝ is a bias term and g is a nonlinear function such as rectified linear unit (ReLU), and we use c = ½c 1 , c 2 , ⋯, c k−h+1 to form the feature map of this filter. Different from the ConvKB that only uses filters of a single shape, our model involves multiple shapes of filters to generate the transitional features from the pretrained embeddings. As shown in Figure 1, multiple filters are designed to capture features from long or short distances. We used N = ½N 1 , N 2 , N 3 to denote the number of filters of three different shapes, which are 1 × 3, 2 × 3, and 3 × 3, so the original model ConvKB can be viewed as a special case of MF-ConvKB with N 2 = N 3 = 0. After convolution, all ∑ 3 i=1 N i feature maps are concatenated into a vector V. Then, the score for the triple ðh, r, tÞ is calculated by a dot product between V and w, where w ∈ ℝ ∑ 3 i=1 N i ×1 is a weight vector, and the score function f is defined as in where Ω i denotes the set of filters of the i-th shape, V i is the feature map generated from the set of filters of the i-th Table 1: The score function and the optimization methods of each related work. In these models, the entities and relations are represented by vectors h, r, t. And in the CNN-based model, g denotes a nonlinear function, * denotes a convolution operator, · denotes a dot product, and Ω denotes a set of filters.

Category
Model Score function Opt.
Embedding-based model k Adam 3 Wireless Communications and Mobile Computing shape, * denotes a convolution operator, and · denotes a dot product.

DP-ConvKB.
Compared to the tentative MF-ConvKB model, we implemented a completely different approach to extract features in DP-ConvKB. Instead of adding a variety of filters, we took advantage of the deep residual technique [39] and built another module of deep residual learning in addition to the pioneer ConvKB model. In this way, more features in a long distance can be taken into consideration. DP-ConvKB takes its name from referring its structure to deep pyramid (DP) CNNs [41]. As shown in Figure 2, the design of DP-ConvKB consists of two modules in general.

ConvKB Module.
The left module is from the original ConvKB, where TransE model is used to initialize the input entities and relations. In order to keep the maximum potential of the transitional features in the transition-based model, we only applied the filter w ∈ ℝ 1×3 to exploit the global relationships among the same dimensional entries of the embedding triple. The feature map V ConvKB is defined as in which Ω denotes the set of filters and * denotes a convolution operator. The output of the final feature map from this module is fed into the right module as the initial layer of region embedding.

Deep Residual Learning Module.
In the deep residual learning module, the initial region embedding is followed by two convolutional layers, an activation layer, and a shortcut connection. This module with shortcut connection can be represented as f ðxÞ + x, where f denotes the skipped layers of feedforward neural networks and x is the shortcut connection as proposed in ResNet [39]. A similar structure has also been applied in DPCNN for text categorization [41]. This design of the shortcut connection can effectively avoid the vanishing gradient problem when training the deep NNs. ReLU was chosen as the activation layer, which corresponds to the design of the feature map.
In this paper, we focused more on the global information among the same dimension between long distances, but different from the semantic features among a long sentence; in the KGC task, the input matrix is only composed of the embeddings of entities and relations, without the semantic characters between different dimensions. In order to capture the features with a relatively long distance, we used the kernel size of 2 for convolution, as shown in Figure 2.
After the first shortcut connect, another circulation block is added, and it consists of a pooling layer, two convolution layers, and a shortcut connection, in which the convolution operation is defined as gðxÞ * Ω + b, where gðxÞ is for ReLU and Ω is the filter initialized by a normal distribution. At the beginning of this submodule, we performed maxpooling with size 2 and stride 2: the pooling layer chooses the maximum over 2 contiguous internal vectors and the 2-stride pooling window reduces the size of the input feature map by half. After this, output from each pooling layer is collected, arranged in the form of a "pyramid," and this is where DP takes its name.
For the shortcut connection f ðxÞ + x, both f ðxÞ and x require the same dimensionality so that they can be summed up. To avoid the extra dimension matching operation, in the DP structure, we unified the convolutional layers with the same number of filters to obtain N * feature maps for each convolution layer. In DP-ConvKB, we set N * = N, where N is the number of filters used in ConvKB (the right module);  Wireless Communications and Mobile Computing the value of region embedding is the same as V ConvKB . After the circulation block, the score function of the model can be written as follows: where V C is the output of the final convolution layer and W is a weight vector.

Experiment Setup
4.1. Dataset. We evaluated MF-ConvKB and DP-ConvKB models on two benchmark datasets: WN18RR and FB15k-237, which are the subsets of the datasets WN18 and FB15k, respectively. This is because WN18 and FB15k contain highly redundant relations, as mentioned in studies [42,43]; these highly redundant relations could lead to inaccurate test results. One example of redundant relations is when a triple and its inverse triple exist in the test set and training set, respectively. This would make the original triple easily retrievable which leads to an overly good result and jeopardize the model's generalization. WN18RR and FB15k-237 are two datasets intentionally designed to remove triples with such inverse relation. Statistics of the two benchmark datasets are given in Table 2.
The relations from the two datasets can be generally classified into 4 categories: 1 − to − 1, 1 − to − M, M − to − 1, and M − to − M, with M for MANY. According to [9], 1 − to − 1 denotes one head can appear with one tail entity at most, and 1 − to − M denotes one head can appear with more than one tail. Similarly, M − to − 1 denotes more than one head can appear with the same tail, and M − to − M: is for the case that multiple heads map to multiple tails. We found that 0.9% of the relations on FB15k-237 is 1 − to − 1 type, and for 1 − to − M, M − to − 1, and M − to − M type, the portion is 6.3%, 20.5%, and 72.3%, respectively. Regarding the WIN18RR dataset, there are 11 relations, among which, also_see, similar_to, verb_group, and derivationally_     Figure 3 shows the percentage of each relation in both test and training sets.

Design of Loss Function.
The loss function was minimized as follows: where ς is the train set collection of valid triples, while ς ′ is a collection of invalid triples generated by corrupting a valid triple ðh, r, tÞ ∈ ς by replacing the head entity or the tail entity with other entities in E. According to Bernoulli trick [4], the new invalid triples ðh ′ , r, tÞ and ðh, r, t ′ Þ occur with the probability η h /ðη h + η t Þ and η t /ðη h + η t Þ, respectively. For a certain relation r, η h denotes the average number of head entities per tail entity, and η t denotes the average number of tail entities per head entity. f DP−ConvKB is the score function as in (5), and ðλ/2Þkwk 2 2 is the L2 regularization on w.

Evaluation
Protocol. The object of this study is to predict the missing entity in a triple, i.e., predicting the head entity with given ðr, tÞ or predicting the tail entity with given ðh, rÞ. To evaluate our proposed models' performance, we focused on ranking the scores of candidate entities from the dataset. We employed three commonly used evaluation metrics in the previous study: mean rank (MR), mean reciprocal rank (MRR), and Hits@10. MR denotes the mean rank of all test triples as calculated in (7). It evaluates whether all of the ground-truth relevant items selected by the model are ranked higher or not.
MRR is calculated by taking the mean of the reciprocal rank for each query of the test triples as in (8), and it only cares about the single highest-ranked relevant item.
Hits@k is the proportion of the rank that is lower than or equal to k, with k usually set to 10 in the link prediction task. Either lower MR, higher MRR, or higher Hits@k indicates a better result of the task. Note that, according to the work of

Wireless Communications and Mobile Computing
TransE [9], we had to remove the corrupt triples that already exist in the datasets when ranking the test triples.

Implementation Details.
Prior to the model training, the entities and relation embeddings were first preprocessed by the embeddings produced from TransE. We then applied stochastic gradient descent (SGD) algorithm with 3000 epochs on WN18RR and FB15k-237 to train the TransE model parameters. We adopted grid search algorithm on the validation set, in order to optimize parameters, including the dimensionality of the word embedding, the margin hyperparameter, and the SGD learning rate. We fixed the batch size = 256 in our objective function. Our experiment showed that, for MF-ConvKB, the highest Hits@10 can be obtained when k = 50, r = 5, and η = 5e −4 on WN18RR and k = 100, r = 1, and η = 5e −4 on FB15k-237; for DP-ConvKB, these optimized parameters were k = 100, r = 1, and η = 5e −4 for both datasets. We used the Adam optimizer to train both MF-ConvKB and DP-ConvKB. For training MF-ConvKB, we set the initial learning rate of Adam [44] at η ∈ f1e −3 , 1e −4 , 5e −5 , 5e −6 g, and set the L 2 -regularizer λ at 0.001 to avoid overfitting. After convolution, we chose ReLU as the activation function. We designed the filters with the shape of 1 × 3, 2 × 3, and 3 × 3. Each filter was initialized by a truncated normal distribution. The highest Hits@10 scores were obtained when using N = ½500,500,500, k = 50, and η = 1e −4 on WN18RR and N = ½100,40,20, k = 100, and η = 5e −6 on FB15k-237. For training DP-ConvKB, we set the initial learning rate of Adam at η ∈ f1e −5 , 1e −6 , 1e −7 , 5e −8 g and set the L 2 -regularizer λ at 0.001. To get the feature map for deep convolution, we initialized the original filter by a truncated normal distribution, and we fixed its shape at 1 × 3. In the deep NN part, convolutional filters were initialized by a normal distribution. The highest Hits@10 scores were obtained when using N = 200, k = 100, and η = 5e −8 on WN18RR and N = 100, k = 100, and η = 1e −7 on FB15k-237. Table 3 shows the link prediction comparison results between our two models and the state-of-the-art models.
As shown in Table 3, we also noticed that the tentative model MF-ConvKB performed just slightly better than the baseline model on all metrics on WN18RR and even obtained a less score of MRR and Hits@10 on FB15k-237. However, it could achieve lower MR than the baseline model on both datasets. The shape of the convolutional filter is known to play a pivotal role in extracting features in fields like image processing [45], speech recognition [46], and sentence classification [47]. However, its effect in the link prediction task is not well understood. To investigate, we embedded convolutional filters with different shapes into MF-ConvKB and test model performances on link prediction tasks. Results are shown in Table 4: the first row with a single 1 × 3 filter represents the baseline model, and starting from the second row, we replaced the filter with different shapes and the final row was from MF-ConvKB. These results suggested that, in general, replacing the single filter in the baseline model with different shapes did not bring substantial improvements and combing multiple different filters can ameliorate performance on MR but failed on MRR and Hits@10 necessarily. Specifically, MF-ConvKB with a variety of filters can have better performances on WN18RR but not necessarily on FB15k-237. Taken together, these results indicate that combining filters of different shapes may not be the most efficient strategy for link prediction task performance improvement. Therefore, we sought other possible structures, e.g., deep NN in our proposed model DP-ConvKB, to improve link prediction task performance.   Performance comparisons for all relations on dataset WN18RR between DP-ConvKB and baseline model are shown in Figure 5. In general, our proposed model DP-ConvKB outperforms the baseline model in all considered relations on both Hits@10 and MRR metrics. Strikingly, when evaluated by MRR score, model performance is completely boosted by DP-ConvKB by more than 2 times.
Particularly, performances on M − M relations of Verb_ group and similar_to reach the highest scores on MRR.
On WN18RR, we also tested the model's performances in predicting heads and tails. Figure 6 depicts the performance of DP-ConvKB in predicting head and tail tasks with the metrics of MRR and Hits@10 on the dataset WN18RR. We can see that the scores of predicting head and tail on almost every relation category are very close. As a comparison, the baseline model has a relatively large gap in predicting head and tail tasks which can be found in Figure 4. Results shown in Figure 6 demonstrate the    10 Wireless Communications and Mobile Computing ability of minimizing discrepancies in predicting head and tail when using DP-ConvKB. Taken together, the overall experimental results show that the deep CNN-based model DP-ConvKB can effectively sum up the global features beyond distance and obtain the best performance for both the head prediction task and tail prediction task.

Conclusion
In this paper, we proposed a CNN-based model, DP-ConvKB, to improve the performance of the knowledge graph completion task. We first showed that simply adding a variety of filters into the pioneer model ConvKB might not be in the right direction to enhance its performance. We then designed a new model, DP-ConvKB, which cooperates a deep pyramid neural network into ConvKB; therefore, it is capable of exploring the deep features. Our results on datasets WN18RR and FB15k-237 show that DP-ConvKB outperforms the baseline model (ConvKB). DP-ConvKB obtains the best mean rank and the highest mean reciprocal rank and Hits@10. To sum up, our study demonstrates that, by implementing such deep convolutional network structure into models for KGC tasks, it can significantly improve the performances.
Although DP-ConvKB has achieved great improvement on the link prediction task, however, it involves more computational complexity and is difficult to find the optimized hyperparameters in the training process. In future work, we plan to prune the deep neural networks to minimize the training time and generalize our structure to other NLP tasks.

Data Availability
The data that support the findings of this study are available from the authors. Requests for access to these data should be made to Xueting Wang, xuetingcuc@163.com.