Recommendation Model Based on Semantic Features and a Knowledge Graph

. In the ﬁ eld of information science, how to help users quickly and accurately ﬁ nd the information they need from a tremendous amount of short texts has become an urgent problem. The recommendation model is an important way to ﬁ nd such information. However, existing recommendation models have some limitations in case of short text recommendation. To address these issues, this paper proposes a recommendation model based on semantic features and a knowledge graph. More speci ﬁ cally, we ﬁ rst select DBpedia as a knowledge graph to extend short text features of items and get the semantic features of the items based on the extended text. And then, we calculate the item vector and further obtain the semantic similarity degrees of the users. Finally, based on the semantic features of the items and the semantic similarity of the users, we apply the collaborative ﬁ ltering technology to calculate prediction rating. A series of experiments are conducted, demonstrating the e ﬀ ectiveness of our model in the evaluation metrics of mean absolute error (MAE) and root mean square error (RMSE) compared with those of some recommendation algorithms. The optimal MAE for the model proposed in this paper is 0.6723, and RMSE is 0.8442. The promising results show that the recommendation e ﬀ ect of the model on the movie ﬁ eld is signi ﬁ cantly better than those of these existing algorithms.


Introduction
With the rapid development of the Internet, smart terminals. and digital resources, recommendation systems are becoming more and more important as a tool for users to obtain the information they need in the network. Moreover, since more and more information have characteristics as short text with few features and incomplete information (e.g., microblog and short messages, movie introductions, and product reviews), accurate recommendation of items based on short texts is a hot topic on the recommendation system.
Unfortunately, the existing recommendation systems have some drawbacks for short text recommendation. For example, the traditional collaborative filtering technology mainly considers user ratings and ignores important semantic factors; LDA-based recommendation is effective in processing texts with complete information, but it has drawbacks in processing short texts; neural network-based recommendation requires a lot of computing resources, and its performance is highly related to the quality of the corpus.
To solve the above problems, we propose a recommendation model based on semantic features and extension using a knowledge graph for the current tremendous amount of items with short text characteristics. We first extend the short texts of items based on a knowledge graph and combine the semantic features of the items according to user ratings, making the recommendation result more semantically accurate.
The main contributions of this paper are summarized as follows: (i) We propose an improved method based on a knowledge graph of short text feature extension; it can extend the semantic information of the item. We first select DBpedia as a knowledge graph, then use DBpedia Spotlight to identify the named entity in the short texts of items, and then obtain the extension words according to the resource pages in DBpedia of these identified entities (ii) We propose a recommendation model based on semantic features and a knowledge graph. We first extend the short texts of items based on a knowledge graph and then combine the user semantic similarity and the item semantic similarity according to the semantic features and further calculate prediction ratings of items The remainder of this paper is organized as follows. Section 2 summarizes the related work on the recommendation system. Section 3 presents the proposed method of short text feature extension based on DBpedia. Section 4 presents the proposed recommendation model. In Section 5, experiments are conducted to verify the effectiveness of the proposed model. Finally, Section 6 presents the conclusions of this paper and the future work.

Related Research
The research of the recommendation system mainly includes the following categories: (1) the traditional collaborative filtering recommendation algorithms, (2) the recommendation algorithms based on deep learning, and (3) the recommendation algorithms based on the content of the item.
First, traditional collaborative filtering recommendation algorithms include user-based collaborative filtering algorithms [1][2][3][4][5] and item-based collaborative filtering algorithms [6][7][8][9]. The idea of the user-based collaborative filtering algorithm is when a user needs to be recommended, we would first find other users who are similar to him and then recommend those users' preferred items to him. The item-based collaborative filtering algorithm is to find items that are similar to the user's preferred items for recommendation. Because these two collaborative filtering recommendation algorithms have some deficiencies, such as items with high popularity and cold start, scholars have proposed several recommendation algorithms based on hybrid strategies [10][11][12]. However, traditional recommendation algorithms mainly consider aspects such as ratings and user features and often ignore important semantic features.
Secondly, with the successful application of deep learning in the fields of computer vision and natural language processing, many scholars have introduced deep learning into the recommendation system, using deep feature extraction to improve recommendation performance.
Georgiev proposed to use RBM to extract potential features of user preferences or item ratings in the recommendation field [13]. In addition, DBN [14] is also used to help extract hidden and useful features from the audio content for content-based and mixed music recommendation. Deep learning can effectively extract the features of user items and other contents through deep structure mining features and enhance the recommendation capability. However, the implementation of a deep learning model requires a lot of computing resources and its recommendation effect is closely related to the corpus, which brings the challenge of con-structing an ideal corpus. Moreover, due to the lack of information for items with short text characteristics, the effect of directly using deep learning methods is not ideal.
Thirdly, some scholars consider recommendation based on the content of the item. This is the continuation and development of collaborative filtering technology. For example, An et al. proposed content-based personalized recommendation of popular microtopics [15]. Some researchers introduced the LDA model; Zhao et al. proposed a Twitter-LDA model suitable for short texts and gave the topic construction idea of the "single microblog-single topic" [16]. Ben-Lhachemi et al. proposed to use the semantic embedding representation of tweets to help users select relevant tweet topics for their posts in real time and then capture the semantic similarity or relevance between tweets, so as to realize the recommendation of tweet topics [17]. However, due to the small number of words in the short texts of items, the topics of the short text are poor and the problem of topic incompleteness occurs in the process of topic modeling; therefore, the effect of using the topic to extend the feature of the short text is not ideal either.
Therefore, for the items with short text characteristics, since the short text contains little or even the lack of information, directly using the above methods will have problems such as unsatisfactory recommendation effect or excessively complicated implementation. Obviously, if we want to obtain more accurate recommendation results, we must first effectively extend the items with short text characteristics.

The Short Text Feature Extension Method Based on DBpedia
There are two main existing text feature extension algorithms applied to short texts: one is based on external documents and search engines [18,19] and the other is based on the external knowledge base [20][21][22]. The first method has a higher correlation between the original feature words and the extended features after processing, but it is difficult to implement and time-consuming. At the same time, it depends on the quality of the results returned by the search engine to a certain extent. The second extension method can alleviate the data sparse problem of short texts, but the extension effect depends on the quality of the external knowledge base, and the amount of calculation is large and timeconsuming. Therefore, considering that short texts of item often contain entities and entities have rich meanings, it is a simple and efficient choice to extend from entities. The DBpedia knowledge graph is chosen as the external knowledge source. On the basis of the Li extension method [23], we propose an improved short text extension method to extend the semantic information of the short texts of items and then integrate the semantic features of the item when recommending.
The extension process of short texts of items based on the DBpedia is shown in Figure 1, and it mainly includes the following tasks: first, we use DBpedia Spotlight [24] to identify entities of the short texts of items and express them as entities of DBpedia (there is noise and need to be filtered), so as to get the source feature entity set. Secondly, based on DBpedia to 2 Wireless Communications and Mobile Computing further extend the previously entities, according to entity resource pages of DBpedia, put the values of the Type attribute for these feature entities as the extended entities of the short texts; then, we obtain the extended words set of the short texts.
One of the tasks is as described in Algorithm 1, first, using DBpedia Spotlight, an open-source named the entity recognition system, to identify the named entity in short texts and then generate a set of entity candidates and further disambiguate the candidate entities. Among them, by calculating the labeling probability of named entities in Wikipedia [25], the candidates that are lower than a certain threshold will be deleted. In the disambiguation processing, the method proposed by Han and Sun [26] is used to realize the disambiguation of the entities.
Algorithm 1: Obtain the feature set of items.Input: short texts of items Output: S -source feature entity set of items 1:S=Ø 2: Use DBpedia Spotlight to annotate named entity words in short texts, and calculate their tagging probability values, and get a set ðWÞ of named entity words according to a given threshold. And W = fw 1 , w 2 , ⋯, w n g: 3: for i = 1 to n 4: According to the Wikipedia page of w i , we get the candidate entity set ðEðwiÞÞ for w i . And EðwiÞ = fe i1 , e i2 , ⋯ , e ik g.

5:
Calculate the probability of each candidate entity in E(w i ) being marked as an entity under the current context conditions, and we select the entity with the largest probability to add to S. In Algorithm 1, we use DBpedia Spotlight to annotate the named entity in the short texts. The threshold of the tagging probability can be set according to your own needs. We refer to the literature [25] and set it to 0.45.
Another task is as described in Algorithm 2, by the entity resource pages in DBpedia; we use the Type attribute value of the entity in the resource page as candidate extension words for the short text; then, the union of the source feature set and the extended words set is taken as the final features set.
Flisar and Podgorelec [27] and others studied the Type, Topic, and Category attributes of the DBpedia knowledge graph entity resource page and believed that the information of the Type attribute is the most effective for the text extension of the entity. In order to avoid introducing ambiguous information, we only consider the Type attribute. This information is obtained from the Infobox of the Wikipedia page while constructing DBpedia. It is closely related to the entity and can achieve high-quality semantic extension.
When choosing an extended entity, we generally decide the choice by comparing the semantic similarity between the entities. There are many ways to measure the semantic similarity of entities, for example, the cosine similarity of the embedding vector of the entity word can be used but the accuracy of the embedding vector depends on the quality of the corpus and training. Therefore, considering that the entity is derived from the Type attribute of source feature entities and these entities are at different levels of the taxonomy structure of the knowledge graph, it is very suitable for passing through the distance of their semantic path and depth to measure the similarity between them. We adopt a simple and efficient method [28], and the semantic similarity is calculated according to equation (1) as follows: where DistðA, BÞ represents the number of edges of the shortest path between entity A and entity B and max depth represents the maximum depth of the taxonomy structure.

Wireless Communications and Mobile Computing
Input: short texts of items Output: S -source feature entity set of items 1:S=Ø 2: Use DBpedia Spotlight to annotate named entity words in short texts, and calculate their tagging probability values, and get a set ðWÞ of named entity words according to a given threshold. And W = fw 1 , w 2 , ⋯, w n g: 3: for i = 1 to n 4: According to the Wikipedia page of w i , we get the candidate entity set ðEðwiÞÞ for w i . And EðwiÞ = fe i1 , e i2 , ⋯, e ik g. 5: Calculate the probability of each candidate entity in E(w i ) being marked as an entity under the current context conditions, and we select the entity with the largest probability to add to S.

Recommendation Model Based on Semantic Features and DBpedia
From Sections 1and 2, we know that most of the current recommendation algorithms ignore important semantic factors. Therefore, we consider semantic features in the recommendation model proposed in this paper. As more and more items (such as microblog, short messages, movie introductions, and product reviews) have short text characteristics, their short text processing cannot simply be processed by traditional text processing methods and usually need to be extended first. But how to extend is also a difficult problem.
In this paper, we use the distance of the semantic path and the depth to measure similarity between entities and give an improved method of item short text extension based on DBpedia and then propose a recommendation model based on semantic features (see Figure 2).

Construct User-Item Rating Matrix.
According to the user's rating of the item, a user-item rating matrix can be constructed.
Definition 1. Assume that in the dataset, the user set is U = fu 1 , u 2 , ⋯, u m g (m users) and the item set is I = fI 1 , I 2 , ⋯, I n g (n items); then, the user-item rating matrix R is defined as a matrix with m rows and n columns, each row represents a user, and the columns correspond to different items and the corresponding elements represent the ratings given by the user for the item, namely, 4.2. User Semantic Similarity Calculation. The short texts of items are extended based on DBpedia after data cleaning, word splitting, stopping word removal, stemming, and other preprocessing; then, we use the word2vec [29] toolkit to deal with the extension word set, so as to get an embedding representing the given dimension of each word. And further, we calculate the vector of the short texts for each item; for simplicity, the average of word embedding is used to represent the vector of the item [30].
Definition 2. Given item q, if the word set of its short text after the above extension is T = fw 1 , w 2 , ⋯, w i , ⋯, w k g, for ∀w i ∈ T, assuming that the corresponding word embedding of a given dimension obtained by word2vec processing is w i * ði = 1, ⋯, kÞ, then, the item vector of q is defined as follows: Definition 3. Assume that the first s items whose rating of u are greater than a given value are selected as q 1 , q 2 , ⋯, q s and the corresponding vector of q i is q i * ði = 1, ⋯, sÞ, we define the representation vector of u as follows: Definition 4. Assume that the representation vectors of u and v are u * and v * (u and v are users), respectively; we define the semantic similarity between u and v as follows: Sim Semðu, vÞ = cos ð u * , v * Þ.

Predict the Ratings of Items.
In the model proposed in this paper, because some users may tend to give high or low ratings to all items when rating, we provide the relative difference of ratings in the prediction to make the results more reasonable. Therefore, we use a strategy based on the average of user ratings to calculate prediction ratings of items.

Input:
The user (u) to be recommended，short texts of items; Output: The prediction rating of the item (i) by the user ðuÞ . 1: For each item i(i∈I), extend short texts of the item based on the DBpedia, get extension feature set W (i) . 2: Calculate the item vector of i according to W (i) . 3: Construct the user-item rating matrix for dataset. 4: Construct user semantic similarity matrix. 5: Select the set KNN(u) composed of the first K users most similar to u. 6: According to formula (5), calculate prediction rating of i by u (p ui ).
Algorithm 3: Recommendation algorithm based on semantic features and DBpedia. Specifically, according to user-item rating matrix and the semantic features of the short texts of items, we integrate the u's semantic similarity to calculate the user u's prediction rating p ui for item i. The formula is as follows: where α and β are the weight coefficients and α + β = 1. The formulas of p ui ′ and p ui ′ ′ are defined as follows: where in equation (6), Sim Semðu, vÞ is semantic similarity and KNNðuÞ represents the set of the K-nearest neighbor of u (calculated by semantic similarity) and r u represents the average of the ratings generated by u and r vi represents the rating of i for v (v is a neighbor of u, and r v represents the average of the ratings generated by v. In equation (7), KNNðiÞ represents the set of the K-near-       6: According to formula (5), calculate prediction rating of i by u (p ui ).

Experiment and Result Analysis
The experiment in this paper uses the MovieLens1M dataset published by UCI for experimentation. MovieLens was developed by the GroupLens project team of the University of Minnesota. It contains 6040 users' ratings of 3952 movies, with a total of 1000209 ratings. Each user will rate at least 20 movies (the rating value is an integer between 0-5 and 0 means that the user did not rate the movie) and also provides auxiliary information such as the user's occupation, movie category, and movie duration. The sparsity of this dataset is 95.80%. The movie data (movies.dat) in the dataset includes fields such as MovieID, Title, and Genres, which represent the movie ID, movie name, and movie style, respectively. In this experiment, data in the two fields of Title and Genres of the movie data is considered to be the short texts of the corresponding movie item. In addition, the user data (user.dat) is randomly divided into 90% training set and 10% test set according to the user ID.
We use genism as the word embedding calculation toolkit in the experiment, which is an open-source third-party python toolkit. It supports a variety of model algorithms including word2vec and streaming training and provides API for common tasks such as similarity calculation and information retrieval.
The hardware environment used in the experiment settings is as follows: CPU is Intel® Core™ i7-8700, and the memory is 16 G Double Data Rate Fourth SDRAM and equipped with 4 TB hard drive and 128 G Solid State Disk. The software environment is a Windows10 operating system, Python3.8, and Gensim development platform.

Evaluation Metrics.
Since the MovieLens1M dataset contains the user's rating for each item, so we can train and learn the user's prediction rating in experiment.
We use mean absolute error (MAE) and root mean square error (RMSE) as evaluation metrics of the experiment, which are widely used in recommendation systems. These metrics can measure the error between the user's actual  7 Wireless Communications and Mobile Computing rating and the prediction rating. When the MAE and RMSE are smaller, the error between the prediction rating and the actual rating is smaller and the prediction rating accuracy of the algorithm is higher. For u (u is a user) to be recommended, let T be the set of items in the dataset where users have rating behavior and jTj is the specific number of T. For ∀i ∈ T, let r ui denote u's actual rating for i and p ui is the prediction rating obtained by the model proposed in this paper. MAE and RMSE are defined as follows:

Coefficient
Determination. The recommendation model based on semantic features and extension using DBpedia includes weight coefficients α and β which appear in equation (5), and the value of the parameters will affect the quality of the recommendation result.
The prediction ratings of items combine the users' semantic features. Therefore, we select several typical weight combinations (as listed in Table 1) and set the number of the nearest neighbor to 20 for experiment and, then, calculate the MAE values of different coefficient weights. The results are shown in Figure 3. It can be found that the recommendation effect of the model proposed in this paper is the best when the weights are set to the fifth case (that is, α = 0:4, β = 0:6). (ii) Compare the recommendation effect of the model in this paper and some other models. Considering that some of these models are not tested based on the number of the nearest neighbor, the best experimental data of various models are selected. As shown in Figure 4, the MAE values for the method proposed in this paper (SF_EU_DBpedia) are always much better than the traditional collaborative filtering recommen-dation technology (CF) at different numbers of the nearest neighbor and so are the RMSE in Figure 5.

Experiment 2.
In the experiment, we choose the traditional collaborative filtering recommendation technology (CF), collaborative filtering recommendation integrating the usercentric natural nearest neighbor (CF3N) [5], enhanced multistage user-based collaborative filtering through nonlinear similarity (EMUCF) [12], and deep neural network-based recommendation algorithm (DNN) [31] in comparison with the method proposed in this paper (SF_EU_DBpedia), since the EMUCF and DNN do not perform experiments based on the numbers of the nearest neighbor; for comparison, we select the best experimental results of various algorithms. The experimental results are shown in Figures 6 and 7 as follows: In Figure 6, the optimal MAE obtained by the EMUCF is about 0.7211 and the method proposed in this paper is 0.6723. In Figure 7, the optimal RMSE obtained by the DNN is about 0.9631 and that by the method proposed in this paper is 0.8442. The experimental results show that the method proposed in this paper is superior to the abovementioned algorithms on MAE and RMSE.

Conclusion
We proposes a recommendation model based on semantic features and a knowledge graph and first select DBpedia as a knowledge graph to extend short text features of items and then integrate semantic features to calculate the prediction rating of the user to be recommended. Experiment results are shown that the proposed model in this paper works well.
As future works, we are planning to calculate the semantic vector that characterizes users for various items of different categories, so as to further improve the general applicability of the model. In addition, we can consider more factors for selecting nearest neighbors, such as user semantic similarity and user characteristics; thus, we can choose suitable factors to achieve more accurate personalized recommendation according to categories of items.

Data Availability
The data included in this paper are available without any restriction.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.