Paper Recommendation via Correlation Pattern Mining and Attention Mechanism

the


Introduction
Information overload has become a serious problem for scientific researchers.By using the user's profile and metadata from papers, the recommendation system is able to discover and recommend scientific articles that researchers have not yet noticed.Thus, it provides a new way for researchers to find interesting articles beyond simple reference tracking or keyword searching.Various types of recommendation systems have been developed in various digital libraries and academic search engines [1] (e.g., Springer Nature and Elsevier, both of which have their own recommendation systems).Personalized scholarly resource recommendation systems, in particular, are expected to alleviate information overload [1][2][3] and have emerged as a new research hotspot.
As shown in Figure 1, academic paper recommender system is a context-sensitive recommender system [4].It is conventional to sort articles by popularity based on user search keywords or keywords from read papers in user profiles.Many various ways have been proposed to improve the effectiveness of the current research paper recommenders.Nascimento et al. [5] offered a method for ranking the pool of candidate papers after receiving a scholarly publication as input.Because it can only return research papers that closely match the keywords, this strategy is prone to the issues that keyword-based search encounters.
Global popularity [6] can reflect the influence of a paper and gain wide application in recommendation systems as important priori information, but we believe that the combination of global popularity and personalized research interests is still an open problem [7].For example, a researcher who is interested in recommender systems and is also looking for novel mathematical tools may want the recommender system to support tasks such as "recommending popular papers with mathematical tools that could potentially be applied to the recommender system."Obviously, algorithms based on keyword search or global popularity-based algorithms are not able to meet such personalized popularity requirements.
The most popular recommendation systems are based on collaborative filtering (CF), which is a process of evaluating items by the opinions of other users.In existing CF algorithms, matrix factorization (MF) models play an important role.However, since the distribution of real-world datasets obeys a typical power-law distribution, MF models usually suffer from overfitting problems.According to a study by Ganguly and Pudi [8] and Mnih and Salakhutdinov [9], a feasible solution to this problem is to regularize the latent eigenvectors in MF models by means of appropriate priors.There are some efforts to improve rating prediction by introducing citation information through algorithms on graphs.Sugiyama and Kan [10] proposed the use of latent citation augmented data, and Liang et al. [11] defined local and global citation relevance to enhance rating prediction.Other works introduce article topic information, such as in a study by Wang and Blei [12] and Pan and Li [13], regularize the item factor to item topic ratio to improve rating prediction.Jelodar et al. [14] used the trend of topics and the relationship between topics and scholars context documents.Wang et al. [15] proposed a socialized regularized regression model that introduces a priori themes.Pan and Li [13] created a modified item-based recommendation approach based on theme similarity assessment to analyze the contents of research publications.This technique addresses the cold-start issue, but because it does not consider the reputation of the recommended articles, the suggestions may be less believable, even if their themes are critical.
Collaborative filtering applied to paper recommendations also faces severe data sparsity and cold-start problems.DBLP (website of computer science bibliography, https://dblp.org/)statistics show that the average number of downloads for a paper is 1.6.MF is difficult to make accurate predictions for users with only a few interactions, which is a well-known problem for most existing recommender systems.For MF models, the potential feature vectors of users with only a few interactions are close to the a priori mean [16], and, therefore, the predicted ratings of users with few ratings can be heavily influenced by other users.Even if a citation network is introduced, the same problem of vector undertraining occurs due to the low number of new papers cited [17].As a result, such a recommendation system will provide inappropriate recommendations, which is typical for scientific article recommendation systems.
We propose a context-aware probability MF model aimed at paper recommendations while alleviating the data sparsity problem and exploiting popularity information.The key idea of our method is to fuse multivariate information by the multiplicative law, which effectively fuses the similarity of citation networks and texts and can alleviate the data sparsity problem.For cold-start papers, we add second-order neighbor nodes to makeup for the problem that newly joined papers don't get enough training.In order to figure out how popular a paper is for a specific person, we suggest a keyword-attention mechanism that takes into account both user preferences and global popularity.We conduct experiments on the CiteULike dataset to evaluate the performance of the proposed model.
Our contributions in this paper are summarized as follows: (1) We suggest a multiplication rule to combine scores related to the topic, scores related to citations, and scores related to the popularity of keywords into a probability matrix decomposition model to makeup for the lack of data.(2) We measure the citation relevance of papers by the distance of their embedding vector in paper citation networks.We also introduce second-order neighbors to solve the problem of new papers not being trained enough.We propose a hard attention mechanism to build personal popularity by combining users' keyword preferences and the global popularity of papers.(3) We conduct extensive experiments on the CiteULike dataset, and the experimental results show that our method is superior to other advanced academic resource recommendation methods.
The remainder of this paper is organized as follows: Section 2 introduces the related work of academic resource recommendation technology, Section 3 proposes our recommendation algorithm, Section 4 introduces the experimental design and performance comparison to verify the effectiveness of the

Related Work
In the current academic big data era, information excess has become a serious problem.Academic search engines rely on keyword search, which is not enough to solve this problem, and personalized recommendation systems for academic resources are expected [2,3].System for recommending academic papers is context sensitive.Currently, depending on the means of implementation, current approaches to paper recommendation systems can be classified into three categories: content-based recommendation [18], collaborative filteringbased recommendation [19], and hybrid method-based recommendation [20], and we also provide an overview of popularity used in recommender systems.Content-based recommender systems require representation learning and similarity computation for text, which commonly include bag-of-words models, topic models, and word vector techniques.For example, term frequency-inverse document frequency (IDF) [21] is used to map articles into vectors by word frequency-IDF.Topic models such as the commonly used latent Dirichlet allocation (LDA) [22] are used to compute the distribution vectors of article topics by generating models.Deep learning methods such as doc2vec [23] and paper2vec [8] use neural network techniques to generate article representation vectors.Based on the above three methods, the system ranks the papers from highest to lowest by calculating the similarity of the vectors.
The content-based approach is easy to implement because the representation of the text can be computed offline.Such methods, however, lack the variety and novelty of results [3] and ignore the popularity of papers.

Collaborative Filtering-Based Paper Recommendation.
Collaborative filtering [24] is one of the most successful general recommendation methods.Collaborative filtering makes recommendations by calculating the similarity of interaction records and does not depend on content information.Compared with content-based recommendation methods, collaborative filtering-based recommendations can produce more novel and enlightening recommendation results.Collaborative filtering has achieved great success in traditional recommendation but has encountered difficulties in paper recommendation, where data are extremely sparse.A temporal analysis of indexed data from CiteULike in a study by Bogers and Avd [25] shows that it takes about 2 years for the cold-start problem to disappear and for recommendation performance to improve.But a large number of papers are constantly being added to the system, and these papers are again in cold start.
Due to the aforementioned problems, most of the work on collaborative filtering has been done for sparse data.Wu et al. [24] compared the performance of collaborative filtering and information-rich recommendation.Sugiyama and Kan [10] proposed an adaptive neighbor selection method to alleviate the problem of data sparsity.Sakib et al. [26] used citation relationships of secondary papers to find hidden associations between papers.Liu et al. [27] clustered users based on their standard preferences, and the recommendation output of users was based on the ratings of other users in the close clusters.However, due to the continuous generation of a large number of new papers, it is currently not feasible to use collaborative filtering alone in a paper recommendation system.

Hybrid Method-Based Paper Recommendation.
Hybrid approaches can combine user behavior records and contextual information to alleviate the cold start problem [28][29][30], which is the paper recommendation systems' main research direction.
To alleviate the problem of sparse data, introducing multiple a priori information sources is an important idea.Sugiyama and Kan [31] investigated which part of an article can be used to represent a paper.Sun et al. [32] used implicit and explicit social relations to find similar users, including social relations, behavioral relations, and profile similarity.Winoto et al. [33] investigated the introduction of multiple contextual information and the personalized preferences of different types of readers.Wang et al. [34] integrated social relationships and preferences into a standard collaborative filtering model.
VOPRec [35] has some similarities to our paper in that it uses a graph representation learning approach that combines information about the textual content and structural similarity of papers in the citation network.However, VOPRec does not take into account the popularity of the papers, so it cannot clearly show the migration of research hotspots.

Popularity in Recommendation
Systems.Systems that recommend items try to forecast user preferences so they can show the user things she would like.Different models have been put forth, ranging from nonpersonalized approaches like "Most Popular" (MostPop) [6] to solutions that are tailored to the individual.MostPop, possibly the simplest recommendation algorithm, suggests highly popular items.MostPop is frequently used as a benchmark to offer a reference performance for a recommender system because it is a simple to create and nonpersonalized suggestion approach.
Popularity is a crucial piece of contextual data in text recommendation algorithms.Examples of recommendations that are responsive to popularity include news recommendation [36], book recommendations [37], and tag recommendation [38][39][40].Additionally, there are efforts that have started to include popularity as crucial contextual data in paper recommendations.To customize the ranking of papers, literature by Liu et al. [41] combines keyword search with undirected citation graphs.In literature by Ng [42], the results of the title-and abstract-based content similarity metrics, the peer researchers' reviews, the author reputations, and the popularity of each candidate article C are all taken into consideration.The popularity of paper C is calculated using the PageRank [43] approach, and the ranking of C is determined using a Borda count voting scheme.
We point out that the MostPop baseline, which is widely used, merely ranks things according to the volume of interactions in the training data.Due to the large number of noninteractive papers, we argue that current popularity assessments (i) are not directly applicable to paper recommender systems and (ii) these methods only concentrate on ranking articles based on popularity under a single keyword, and rarely consider ranking based on multiple keywords.

Context-Aware Probabilistic
Matrix Factorization Definition 2. (Popularity of unread article).We define a formula that smoothly fit the popularity of uncited papers, as well as heavily cited papers.

Definition 3. (Citation networks).
A citation network is a citation relationship between papers, which can be represented by the adjacency matrix of a graph with papers as nodes.
Definition 4. (Neighbors).In a citation network, for a given paper, a paper that is reachable by one hop is its first-order neighbor, a paper that is reachable by two hops is its secondorder neighbor, and so on.
As shown in Figure 2, our method aims to generate paper recommendations by combining multiple contextual elements.
First, we calculate the relevance of user-visited and unvisited papers in the citation network using user profiles.The popularity scores of papers are then calculated by combining the keyword hard attention mechanism with the degree of papers in the citation network.Then, using user profiles, we compute topic similarity scores between users and target literature.Finally, we compute the user's interest score for unread papers by combining the above a priori information.

Correlation Model of Citation
Network Based on Self-Supervised Learning.In this paper, we consider the citation network as a directed graph network and calculate paper embedding using first-order similarity.
Given the citation network, the recommender system must use as much information as possible to calculate the article's similarity score in the citation network.The model uses an article node's neighbors as contextual information for the current article, assuming that if two articles share more neighbors, their contexts are closer and, thus, more relevant.We create a citation correlation score based on this assumption.To begin, it is assumed that the article, as a node, has its own low-dimensional vector representation p and a contextual low-dimensional vector representation p 0 , with p being close to p 0 when the article is used as a context.
For any edge <i; j> in the citation network, the conditional probability of generating v j by node v i can be expressed, as shown in Equation ( 1): where V k k is the number of neighbors.Two nodes are comparable if their contextual distributions are similar, and the contextual distribution should closely match their empirical distributions.The empirical distribution can be defined, as shown in Equation ( 2): where the weight of the edge <i; j> is represented by d i .In this case, the value of d i is the outdegree of node v i .KL divergence is a function that measures the difference between two probability distributions.KL divergence is used as an objective function in the following to measure the difference between the contextual and empirical distributions.The objective function can be simplified, as shown in Equation (3).Negative sampling is used in the model optimization process to reduce computation.The meaning of this formula is that one edge <i; j> in the citation network is sampled at a time as a positive sample, and then k nodes n are sampled from the noise distribution P n v ð Þ to form k negative samples <i; n>.We optimize the objective function using both positive and negative samples.In this paper, we use positive and negative sampling to optimize the objective function.A user: Degree matrix for all papers Degree is the number of times an article v is cited: Bag of words for articles w i Bag of words for an article: User-article reading record matrix Adjacency matrix for citation networks G V j j× W j j Thesis-keyword matrix 4 Journal of Sensors Among them, σ ⋅ ð Þ is sigmoid function.according to the setting of paper [44], P n v ð Þ∝ d v ð Þ 0:75 .E is the mathematical expectation.Each time, the computational model selects an edge <i; j> from the sampling citation network as a positive sample, and samples K sample nodes from the noise distribution P n v ð Þ to form a negative sample <i; n>.If there aren't many citations for a new paper, as shown in Figure 3, then the learning of the embedding vector for the new paper isn't good enough.This paper intuitively adopts the sampling of neighbor nodes' neighbors to alleviate this problem.For papers with fewer citations in the citation    Journal of Sensors network, second-order neighbor sampling is used.For example, for the path i → k → j, the empirical distribution is shown in Equation ( 4): Our self-supervised citation graph embedding method is summarized in Algorithm 1. Finally, the model calculates the citation network correlation score as N u i ; ð v j Þ ¼ σ ∑p 0T j p i , where N u i is the set of papers in user u i 's reading record.In particular, the value of d v j is read from the adjacency matrix D of the citation network.According to literature by Mikolov et al. [44], we set the power of d v to 0.75.

Methods for
Popularity.The paper's keywords, which unmistakably identify its major concepts, are a significant component.As a result, the article's keywords can be used to categorize it, and the classification's relevance can be used to estimate a user's relevance score for an unselected article.Also, an article's popularity shows the research community's interest in the information it offers.Hence, it is advantageous to take advantage of article suggestions' popularity.We specifically suggest a brand-new attention technique that blends user keyword preferences and article popularity.Popularity analysis and keyword-weighted popularity computation are the two processes in this process.

Popularity Analysis.
We use the citation network to calculate the objective popularity of an article, and the popularity should not shift in the short term.In this paper, we use a weight setting method similar to that in weighted matrix factorization [45] to calculate objective popularity, as shown in Equation ( 5): where d is the degree of the article in the citation network.ϵ in the Equation ( 5) is used to control the rate at which the popularity grows with the number of cited, and a logarithmic function is used so that the popularity does not grow too quickly.This approach ensures that all papers have a nonzero weight on their popularity, and as the number of user visits increases, the popularity of the paper increases accordingly.The popularity of uncited articles is 1.

Personalize Popularity Scoring Based on Hard
Attention Mechanism of Keywords.We use a hard attention mechanism to personalize the popularity of papers.First, we introduce the hard attention mechanism by defining c u i ; v j as the keyword preference of user u i for article v j , i.e., c u i ; v j is the ratio of the intersection of user u i for article v j keyword c g to the concurrent set as c u i ; v j ¼ u i j j∩ v j j j u i j jU v j j j .
Then, we use the hard attention mechanism to weight the popularity of the unread article v j , i.e., p c g ; v j .The keyword popularity p u i ; v j of user u i on article v j is defined, as shown in Equation ( 6): where C is the set of keywords in the article collection.A larger value of keyword popularity p u i ; v j means that more the classification of the article meets the preference of user u i .

Topic Relevance Model
3.4.1.Topic Model.To compute topic similarity, we construct an LDA model, as shown in Figure 4.This model has two latent variables: (1) paper-topic distribution Θ and (2) topic-word distribution B. We believe the topic distribution of papers and their readers is similar.We aggregate the bagof-words d lj of any paper, and, similarly, we aggregate all the papers read by any user into his user documents and form the bag-of-words d ui , so that we obtain multiple document sets, where each document corresponds to a paper or a user.Input: G V j j× V j j : Adjacency matrix for citation networks; D: The matrix of degrees in a citation network; K: latent vector dimension Output: p k f g: Latent vector matrix for papers 1: Randomly initialize v k f g; 2: For v from 1:-V-Do 3: Random select negative node samples by ALGORITHM 1: Self-supervised citation embedding. 6

Journal of Sensors
The generative process of LDA is given as follows: (1) For each paper and user document, draw topic proportions θ j ∼ Dirichlet α ð Þ. (2) For each word n, draw topic assignments z j; n ∼ Mult θ j À Á , draw word w j; n ∼ Mult β z jn , and directly calculating θ and β are difficult.Therefore, we estimate their values using variational inference.We define the topic score between user u i and article v j as the similarity between user topic distribution θ i and article topic distribution θ j .We choose the Jensen-Shannon divergence between user u i and article v j , as shown in Equation ( 7): Þare the Kullback − Leibler distance.Topic interest-related scores are defined, as shown in Equation ( 8): The range of S u i ; ð v j Þ is between 0 and 1, and the closer the article is to the reader's topic interest, the closer the topic score S u i ; ð v j Þ is to 1.

Context-Aware Probability Matrix Factorization.
We propose a context-aware probabilistic matrix decomposition article recommendation algorithm based on the preceding sections.

Context Fusion. R ¼ r ui
½ is the user's reading record matrix for articles.For article recommendation, we use topic information, citation information, popularity information, and keyword information to fit R. We use the multiplication rule [46][47][48] to fuse topics, citations, and keyword popularity, so as to effectively improved system robustness and cold start performance.Based on the multiplication rule, a unified preference score is defined, as shown in Equation ( 9): where S u i ; ð v j Þ is the topic score, N u i ; ð v j Þ is the citation score, and p u i ; v j is the personlize popularity score.

Computational Process. Each 1 in R U
j j× V j j means a certain article v i readed by a user u i .There are M users and N articles in total.U i and V j are latent feature vectors for users and articles.We define the conditional distribution of observed check-ins, as shown in Equation (10): where N ⋅ |μ; ð σÞ is a Gaussian spherical distribution function with mean μ and variance σ, and I ij is an indicator function (I ij ¼ 1 if the user i has read the article j and I ij ¼ 0 otherwise).We use the function, as shown in Equation ( 11), to fit user u i 's check-in to an article v i .
The context is computed in Chapter 3, and we use the weighted inner product of the three to improve the probabilistic matrix factorization (PMF) model.
We set the potential feature vectors of users and articles to a Gaussian spherical prior with zero mean, as shown in Equations ( 12) and (13).
The posterior distribution is then obtained, as shown in Equation ( 14), by a straightforward Bayesian inference: Consider the posterior distribution of the latent features of users and articles using the logarithmic function, as shown in Equation (15): Journal of Sensors where D represents the dimension of the latent factor and P is a constant that is independent of the parameter.While maximizing the log posterior, the hyperparameters of the latent vectors of users and articles are kept fixed.The optimize procedure is equivalent to minimization of the squared error and a quadratic regularization term of the objective function, as shown in Equation ( 16): , and ⋅ k k 2 F are the Frobenius norm.For the optimization of the objective function, as shown in Equation (16), the stochastic gradient descent algorithm is used, as shown in Equations ( 17) and ( 18): The stochastic gradient descent algorithm is used as follows: For a given article, our model uses i V j to predict the user's interest score for it.For cold-start target articles, the calculation is performed based on the context parameters and the global average vector Ū l , V j .Our proposed method is summarized in Algorithm 2.

Experiments
4.1.Experimental Setup.We evaluate the performance of our method on academic recommendation dataset.

Dataset Description.
CiteULike is a large social networking site that is based on sharing academic literature.Users can upload, recommend, and manage literature on the site.We use the dataset provided by Wang et al. [15].These datasets are extracted from the open source data on the CiteULike website.The two datasets include user ratings of papers, titles and abstracts of papers, tags of papers, and citation relations of papers, respectively.The citation context of a paper is the most typical contextual relation.As with most datasets in the recommender system domain, the rating data in these two datasets are very sparse.We did not make any adjustments to the datasets for the fairness of the comparison.The statistics of the CiteULike dataset are shown in Table 2.As shown in Figure 5, the two datasets conform to the long-tail distribution.
4.1.2.Evaluation Indicators.We use implicit feedback to recommend K articles to users.We use two Top − K metrics to evaluate the quality of the recommended list: Precision @ K and Recall @ K, which are defined as follows.
where K is the number of articles recommended to the user, R u ð Þ is the Top − K list of articles recommended to the user, and T(u) is the number of articles actually accessed by the user.In the LDA model, we set α ¼ 50=T and β ¼ 0:1.For the prevalence score, we found that it is sensitive to the parameter ε and works best when ε ¼ 0:5.V: Paper latent vector matrix 1: Randomly initialize U, V 2: C i; j ← S u i ; ð v j Þ⋅ N u i ; ð v j Þ⋅ p u i ; v j 3: For each r u i ; v j in G U j j× V j j Do 4: 8

Journal of Sensors
MostPop [49]: MostPop counts the number of occurrences of each item in the training set and ranks the items based on that number.
UserKNN [50]: UserKNN is a recommendation algorithm based on user-based collaborative filtering.The cosine similarity between users is calculated based on their check-in data to the paper.
ItemKNN [51]: ItemKNN is a recommendation algorithm based on item-based collaborative filtering.The cosine similarity is used in itemKNN to calculate the similarity between papers.
PMF [9,52]: PMF is a traditional MF algorithm that uses a normal distribution prior to represent the hidden eigenvectors with a probabilistic graphical model and Gaussian noise.
CDR [53]: CDR is a hierarchical Bayesian framework for sparsity reduction that combines deep feature representations of item content with user implicit preferences.
CTR [12]: CTR, when combined with MF and topic distribution, can help to solve the problem of data sparsity and article cold start.
We randomly select 80% of CiteULike dataset as training data, and the remaining 20% as test data.For a fair comparison, we refer to the corresponding literature or experimental results of the comparison algorithms to set the parameters of the different algorithms.Under these parameter settings, each comparison algorithm achieves the best performance.In UserKNN and ItemKNN, the number of similar neighbors of users or academic papers is 30; in PMF, λ U ¼ λ V ¼ 0:001; we set the learning ratio of PMF η ¼ 0:001, and the learning ratio of algorithms based on MF academic papers to η ¼ 0:001.
UserKNN and ItemKNN use the user-paper 0 − 1 matrix to calculate the similarity between users or papers; PMF, CTR, and our method learn the hidden feature vectors of users and academic papers on the user-paper interaction 0 − 1 matrix; LDA, CTR, and our method generate topic distributions by learning bags of user words and paper words.3 shows the experimental results of all the comparison algorithms on the two datasets after proper parameter tuning.From Table 3, we can find: (1) Surprisingly, on both datasets, MostPop, which is based on a simple popularity calculation, shows  Bold values signify the optimal result in baselines.
Journal of Sensors higher recall than other collaborative filtering algorithms, suggesting that the popularity of papers in academic paper recommendation algorithms is very important.(2) In the memory-based algorithm for recommending academic papers, UserKNN has more accurate recommendations than ItemKNN.This is because a large number of academic papers share few user check-in data, so the similarity between academic papers is not as accurate as the calculation of similarity between users.This observation is consistent with the observation in a study by Bogers and Avd [25].
( 10 Journal of Sensors context information to help with the task of recommending academic papers.(5) On the two datasets, the performance of the MF academic paper recommendation algorithm based on context fusion is better than other comparison algorithms, which verifies the effectiveness of our method.Compared with the best results in the comparison algorithms, with Precision@5 as the metric, the improvement rates of the algorithm proposed in this paper on the CiteULike-a and CiteULike-t datasets are 5.1% and 2.2%, respectively; with Recall@5 as the metric, the improvements of the algorithm proposed in this paper on the CiteULike-a and CiteULike-t datasets are 4.2% and 4.7%, respectively.This observation indicates that using contextual information to modify the Gaussian distribution to model users' check-in behavior can effectively improve the performance of PMF. ( 6) On all evaluation metrics, all the comparison algorithms perform better on the CiteULike-a dataset than the CiteULike-t dataset.This is because the average number of users reading academic papers in CiteULike-a is higher than that in CiteULike-t, and the CiteULike-t dataset is more sparse.

Ablation Experiment.
We value the effect of context factors on precision and recall.In this section, we fix the topic score, the popularity score, and the citation score to 1, respectively, and observe changes in precision and recall.Figure 6 illustrates the sensitivity of Precision@5, Precision@10, Recall@5, and Recall@10 to the topic score, Journal of Sensors popularity score, and citation score on the CiteULike-a dataset.We can see that the limited introduction of prior information has a positive effect on the indicators, with the correlation between citations having the greatest impact.
4.2.3.Influence of Parameter K d .The dimensionality K d of the hidden feature vector is also an important parameter that affects the performance.We change the value of K d from 10 to 150 in increments of 10 and observe the changes in recall and precision for the CiteULike-a and CiteULike-t datasets.
Other parameters are set as: λ U ¼ λ V ¼ 0:001, η ¼ 0:001.The experimental results are shown in Figure 7.As can be seen from the figures, the values of recall and precision first increase as K d increases, and after reaching the optimal value, recall and precision decrease as they increase.On the CiteULike-t dataset, the effect of parameter K d on recall and precision shows a similar trend, but with smaller optimal values.This observation suggests that increasing the value of K d excessively can cause some problems and introduce some noise that can reduce the accuracy of recommendation algorithm.

Conclusion and Future Work
This paper proposes a new context-aware algorithm to address the issues of data sparsity and cold start encountered by the paper recommendation algorithm, as well as the relationship between user interest and paper popularity.The method analyzes the influence of various contextual information on the user's reading behavior and integrates the citation network context, thesis text, keywords, and thesis popularity into the preference acquisition model, so as to extract more accurate user preferences.We also propose a hard attention mechanism to personalize the calculation of the popularity of academic resources, and experimental results on real datasets show that it can significantly improve the recommendation effect.
In recent years, deep learning [54] has shown great advantages in computer vision and natural language processing, and some researchers have combined deep learning techniques with traditional collaborative filtering techniques, such as in the literature by Gündogan and Kaya [20], Wang et al. [55], and Hassan [56], and it would be an interesting research direction to fuse deep learning techniques with our context-aware method.

FIGURE 3 :
FIGURE 3: Toy example of citation network embedding.

FIGURE 4 :
FIGURE 4: Topic model for user profile and papers.

4. 1 . 3 .
Baselines.The following article recommendation techniques are compared to our method:

FIGURE 5 :
FIGURE 5: Logarithmic frequency distribution of CiteULike dataset.(a) Logarithm of paper reading times.(b) Logarithm of paper reading times.

FIGURE 7 :
FIGURE 7: Influence of parameter K d .

TABLE 1 :
Key symbols in this paper.

TABLE 2 :
Statistics of datasets.