Information Retrieval and Graph Analysis Approaches for Book Recommendation

A combination of multiple information retrieval approaches is proposed for the purpose of book recommendation. In this paper, book recommendation is based on complex user's query. We used different theoretical retrieval models: probabilistic as InL2 (Divergence from Randomness model) and language model and tested their interpolated combination. Graph analysis algorithms such as PageRank have been successful in Web environments. We consider the application of this algorithm in a new retrieval approach to related document network comprised of social links. We called Directed Graph of Documents (DGD) a network constructed with documents and social information provided from each one of them. Specifically, this work tackles the problem of book recommendation in the context of INEX (Initiative for the Evaluation of XML retrieval) Social Book Search track. A series of reranking experiments demonstrate that combining retrieval models yields significant improvements in terms of standard ranked retrieval metrics. These results extend the applicability of link analysis algorithms to different environments.


Introduction
In recent years, document retrieval and recommendation have become more and more popular in many Web 2.0 applications where user can request documents. There has been much work done both in the industry and academia on developing new approaches to improve the performance of retrieval and recommendation systems over the last decade. The interest in this area still remains high to help users to deal with information overload and provide recommendation or retrieval content (books, restaurants, movies, academic publications, etc.). Moreover, some of the vendors have incorporated recommendation capabilities into their commerce services, for example, Amazon in book recommendation.
Existing document retrieval approaches need to be improved to satisfy user's information needs. Most systems use classic information retrieval models, such as language models or probabilistic models. Language models have been applied with a high degree of success in information retrieval applications [1][2][3]. This was first introduced by Ponte and Croft in [4]. They proposed a method to score documents, called query likelihood. It consists of two steps: estimate a language model for each document and then rank documents according to the likelihood scores resulting from the estimated language model. Markov Random Field model was proposed by Metzler and Croft in [5]; it considers query term proximity in documents by estimating term dependencies in the context of language modeling approach. From the existing probabilistic models, InL2 a Divergence from Randomnessbased model was proposed by Amati and Van Rijsbergen in [6]. It measures the global informativeness of the term in the document collection. It is based on the idea that "the more the term occurrences diverge from random throughout the collection, the more informative the term is." The limit of such models is that the distance between query terms in documents is not considered.
In this paper, we present an approach that combines probabilistic and language models to improve the retrieval performances and show that the two models combined act much better in the context of book recommendation.
In recent years, an important innovation in information retrieval appeared which consists of algorithms developed 2 The Scientific World Journal to exploit the relationships between documents. One of the important algorithm is Google's PageRank [7]. It has been successful in Web environments, where the relationships are provided with the existing hyperlinks into documents. We present a new approach for document retrieval based on graph analysis and exploit the PageRank algorithm for ranking documents with respect to a user's query. In the absence of manually created hyperlinks, we use social information to create the Directed Graph of Documents (DGD) and argue that it can be treated in the same manner as hyperlink graphs. Experiments show that incorporating graph analysis algorithms in document retrieval improves the performances in term of the standard ranked retrieval metrics.
Our work focuses on search in the book recommendation domain, in the context of INEX Social Book Search track. The document collection contains Amazon/LibraryThing book descriptions and the queries, called topics, are extracted from the LibraryThing discussion forums.
In the rest of the paper, we presented a summary of related work in document retrieval and recommender systems. Then, we describe briefly the used retrieval models and show the combination method. In Section 5, we illustrate the graph modeling method followed by the different experiments and results.

Related Work
This work is first related to the area of document retrieval models, more specially language models and probabilistic models. The unigram language models are the most used for ad hoc information retrieval work; several researchers explored the use of language modeling that captures higher order dependencies between terms. Bouchard and Nie in [9] have showed significant improvements of retrieval effectiveness with a new statistical language model for the query based on three different ways: completing the query by terms in the user's domain of interest, reordering the retrieval results, or expanding the query using lexical relations extracted from the user's domain of interest.
Divergence from Randomness (DFR) is one of several probabilistic models that we have used in our work. Abolhassani and Fuhr have investigated several possibilities for applying Amati's DFR model [6] for content-only search in XML documents [10].
This work also relates to the category of graph based document retrieval. There has been increasing use of techniques based on graphs constructed by implicit relationships between documents. Kurland and Lee performed structural reranking based on centrality measures in graph of documents which has been generated using relationships between documents based on language models [11]. In [12], Lin demonstrates the possibility to exploit document networks defined by automatically generated content-similarity links for document retrieval in the absence of explicit hyperlinks. He integrates the PageRank scores with standard retrieval score and shows a significant improvement in ranked retrieval performance. His work was focused on search in the biomedical domain, in the context of PubMed search engine.

INEX Social Book Search Track and Test Collection
SBS task (http://social-book-search.humanities.uva.nl/) aims to evaluate the value of professional and user's metadata for book search on the Web. The main goal is to exploit search techniques to deal with complex information needs and complex information sources that include user profiles, personal catalogs, and book descriptions. The SBS task builds on a collection of 2.8 million book description crawled by the University of Duisburg-Essen from Amazon (http://www.amazon.com/) [13] and enriched with content from LibraryThing (http://www.librarything .com/). Each book is identified by an ISBN and is an XML file. It contains content information like title information, Dewey Decimal Classification (DDC) code (for 61% of the books), category, Amazon product description, and so forth. Amazon records contain also social information generated by users like tags, reviews, and ratings. For each book, Amazon suggests a set of "Similar Products" which represents a result of computed similarity based on content information and user behavior (purchases, likes, reviews, etc.) [14].
SBS task provides a set of queries called topics each year where users describe what they are looking for (books for a particular genre, books of particular authors, similar books to those that have been already read, etc.). These requests for recommendations are natural expressions of information needs for a large collection of online book records. The topics are crawled from LibraryThing discussion forums.
The topic set consists of 680 topics and 208 topics in 2014 and 2015, respectively. Each topic has a narrative description of the information needs. The topic set of 2015 is a subset of that of 2014. Each topic consists of a set of fields. In this contribution we use title, mediated query (query description), and narrative fields. An example of topic is illustrated in Figure 1.

Retrieval Models
This section presents brief description and combination method of retrieval models used for book recommendation.

InL2 of Divergence from Randomness.
We used InL2, Inverse Document Frequency model with Laplace aftereffect and Normalization 2. This model has been used with success in different works [15][16][17][18]. InL2 is a DFR-based model (Divergence from Randomness) based on the Geometric distribution and Laplace law of succession. The DFR models are based on this idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word in the document " [19]. For this model, the relevance score of a document for a query is given by where qtw is the query term weight given by qtf/qtf max ; qtf is the query term frequency; qtf max is the maximum query The Scientific World Journal 3 <title>Which LISP?</title> <group>Purely Programmers</group> <narrative>It'll be time for me to shake things up and learn a new language soon. I had started on Erlang a while back and getting back to it might be fun. But I'm starting to lean toward Lisp-probably Common Lisp rather than Scheme. Anyone care to a good first Lisp book? Would I be crazy to hope that there's one out there with an emphasis on using Lisp in a web and/or system administration context? Not that I'm unhappy with PHP and Perl, but the best way for me to find the time to learn new language is to use it for my work. . . </narrative> </topic> <topic id="1116"> <mediated_query>introduction book to Lisp</mediated_query> term frequency among the query terms. is the number of documents in the whole collection and is the number of documents containing . tfn corresponds to the weighted sum of normalized term frequencies tf for each field , known as Normalization 2 and given by where tf is the frequency of term in the document ; is the length of the document in tokens and avg is the average length of all documents; is a hyperparameter that controls the normalization applied to the term frequency with respect to the document length.

Sequential Dependence Model of Markov Random Field.
Language models are largely used in document retrieval search for book recommendation [16,20]. Metzler and Croft's Markov Random Field (MRF) model [21,22] integrates multiword phrases in the query. Specifically, we used the Sequential Dependence Model (SDM), which is a special case of MRF. In this model cooccurrence of query terms is taken into consideration. SDM builds upon this idea by considering combinations of query terms with proximity constraints which are single term features (standard unigram language model features, ), exact phrase features (words appearing in sequence, ), and unordered window features (require words to be close together, but not necessarily in an exact sequence order, ). Finally, documents are ranked according to the following scoring function: where feature weights are set based on the author's recommendation ( = 0.85, = 0.1, = 0.05) in [20].
, , and are the log maximum likelihood estimates of query terms in document as shown in Table 1, computed over the target collection using a Dirichlet smoothing. We applied this model to the queries using Indri (http://www.lemurproject.org/indri/) Query Language (http://www.lemurproject.org/lemur/IndriQueryLanguage .php).

4.3.
Combining Search Systems. The use of different retrieval systems retrieves different sets of documents. Combining the output of many search systems, in contrast to using just a single retrieval technique, can improve the retrieval effectiveness as shown by Belkin et al. in [23] where the authors have combined the results of probabilistic and vector space models. In our work, we combined the results of InL2 model and SDM model. The retrieval models use different weighting schemes; therefore we should normalize the scores. We used the maximum and minimum scores according to Lee's formula [24] as follows: It has been shown in [16] that InL2 and SDM models have different levels of retrieval effectiveness; thus it is necessary to weight individual model scores depending on their overall performance. We used an interpolation parameter ( ) that we have varied to get the best interpolation that provides better retrieval effectiveness.

Graph Modeling
We have studied the INEX SBS collection to link documents. In [25], the authors have used PubMed collection and exploited networks defined by automatically generated content-similarity links for document retrieval. In our case, we exploited a special type of similarity based on several factors. This similarity is provided by Amazon and corresponds to "Similar Products" given generally for each visited book.

4
The Scientific World Journal Table 1: Language modeling-based unigram and term weighting functions. Here, , is the number of times term matches in document , , is the number of times term matches in the entire collection, | | is the length of document , and | | is the size of the collection. Finaly, is a weighting function hyperparameter that is set to 2500.
Weighting Description Weight of exact phrase " +1 " in document .
(1) init ← Retrieving Documents for each ∈ (2) for each for each StartingNode in StartingNodes do (5) graph ← graph + neighbors(StartingNode, DGD) (6) SPnodes ← all ∈ ShortestPath(StartingNode, StartingNodes , DGD) (7) graph ← graph + SPnodes (8) Delete all duplications from graph (9) final ← final + ( + graph ) (10) Delete all duplications from final (11) Rerank final Algorithm 1: Retrieving based on DGD feedback. Amazon suggests books in "Similar Products" according to there similarity to the consulted book. The degree of similarity is relative to user's social information like clicks or purchases and content based information like book attributes (book description, book title, etc.). The exact formula that combines social and content based information to compute similarity is not delivered by Amazon.
To perform data modeling into DGD, we extracted the "Similar Products" links between documents. The constructed DGD will be used to enrich returned results by the retrieval models; that is, we use the graph structure in the same spirit of pseudorelevance-feedback algorithms. Our method can potentially serve to greatly enhance the retrieval performances.
In this section we use some fixed notations. The collection of documents is denoted by . In , each document has a unique ID. The set of queries called topics is denoted by , the set init ⊂ refers to the documents returned by the initial retrieval model. StartingNode indicates document from init which is used as input to graph processing algorithms in the DGD. The set of documents present in the graph is denoted by .
indicates the documents retrieved for topic ∈ . Each node in the DGD represents document (Amazon description of book) and has set of properties: (i) ID: book's ISBN; (ii) content: book description that includes many other properties (title, product description, author(s), users' tags, content of reviews, etc.); (iii) MeanRating: average of ratings attributed to the book; (iv) PR: value of book PageRank.  Nodes are connected with directed links; given nodes { , } ∈ , if points to , is suggested as Similar Product to . In Figure 2, we show an example of DGD, network of documents. The DGD network contains 1.645.355 nodes (89.86% of nodes are in the collection and the rest do not belong to it) and 6.582.258 relationships.

Our Approach.
The DGD network contains useful information about documents that can be exploited for document retrieval. Our approach is based first on results of traditional retrieval approach and then on the DGD network to find other documents. The idea is to suppose that the suggestions given by Amazon can be relevant to the user queries.
We introduce the algorithm called "retrieving based on DGD feedback." Algorithm 1 takes the following as inputs: retrieval techniques described in Section 3, DGD network, and parameter which is the number of the top selected StartingNode from init denoted by StartingNodes . We fixed to 100 (10% of the returned list for each topic). The algorithm returns list of recommendations for each topic denoted by " final ". It processes topic by topic and extracts the list of all neighbors for each StartingNode. It performs mutual Shortest Paths computation between all selected StartingNode in DGD. The two lists (neighbors and nodes in computed Shortest Paths) are concatenated; after that all duplicated nodes are deleted. The set of documents in returned list is denoted by " graph ". A second concatenation is performed between initial list of documents and graph (all duplications are deleted) in new final list of retrieved documents; " final " reranked using different reranking schemes. Figure 3 shows the architecture of the document retrieval approach based on DGD feedback. The numbers on arrows represent instructions in Algorithm 1.

Experiments and Results
In this section, we describe the experimental setup we used for our experiments. Furthermore, we present the different reranking schemes used in previously defined approaches.

Experiments Setup.
For our experiments, we used different tools that implement retrieval models and handle the graph processing. First, we used Terrier (Terabyte Retriever) (http://terrier.org/) Information Retrieval framework developed at the University of Glasgow [26][27][28]. Terrier is a modular platform for rapid development of large-scale IR applications. It provides indexing and retrieval functionalities. It is based on DFR framework and we used it to deploy InL2 model described in Section 4.1. Further information about Terrier can be found at http://ir.dcs.gla.ac.uk/terrier. A preprocessing step was performed to convert INEX SBS corpus into Trec Collection Format (http://lab .hypotheses.org/1129), by considering that the content of all tags in each XML file is important for indexing; therefore the whole XML file was transformed on one document identified by its ISBN. Thus, we just need two tags instead of all tags in XML, the ISBN and the whole content (named text).
Secondly, Indri (http://www.lemurproject.org/indri/), Lemur Toolkit for Language Modeling and Information Retrieval was used to carry out a language model (SDM) described in Section 4.2. Indri is a framework that provides state-of-the-art text search methods and a rich structured query language for big collections (up to 50 million documents). It is a part of the Lemur project and developed by researchers from UMass and Carnegie Mellon University. We used Porter stemmer and performed Bayesian smoothing with Dirichlet priors (Dirichlet prior = 1500).
In Section 5.1, we have described our approach based on DGD which includes graph processing. We used Net-workX (https://networkx.github.io/) tool of Python to perform shortest path computing, neighborhood extraction, and PageRank calculation.
To evaluate the results of retrieval systems, several measurements have been used for SBS task: Discounted Cumulative Gain (nDCG), the most popular measure in IR [29], Mean Average Precision (MAP) which calculates the mean of average precisions over a set of queries, and other measures: Recip Rank and Precision at the rank 10 (P@10).

Reranking Schemes.
In this paper, we have proposed two approaches. The first (see Section 4.3) consists of merging the results of two different information retrieval models which are the language model (SDM) and DFR model (InL2). For topic , each of the models gives 1000 documents and each retrieved document has an associated score. The linear combination method uses the following formula to calculate final score for each retrieved document by SDM and InL2 models: where score InL2 ( , ) and score SDM ( , ) are normalized scores. is the interpolation parameter set up at 0.8 after several tests on topics 2014, according to the MAP measure shown in Figure 4. The second approach (described in Section 5.1) uses the DGD constructed from the "Similar Products" information. The document set returned by the retrieval model are fused to the documents in neighbors set and Shortest Path results. We tested different reranking methods that combine the retrieval model scores and other scores based on social information. For each document in the resulting list, we calculated the following scores: (i) PageRank, computed using NetworkX tool: it is a wellknown algorithm that exploits link structure to score the importance of nodes in a graph. Usually, it has been used for hyperlink graphs such as the Web [30].
(ii) MeanRatings, information generated by users: it represents the mean of all ratings attributed by users for a book.
The computed scores were normalized using this formula: normalizedScore = oldScore/maxScore. After that, to combine the results of retrieval system and each of normalized scores, an intuitive solution is to weight the retrieval model scores with the previously described scores (normalized PageRank and MeanRatings). However, this would favor documents with high PageRank and MeanRatings scores even though their content is much less related to the topics.

Results.
In this section, we discuss the results we achieved by using the InL2 retrieval model, its combination to the SDM model, and retrieval system proposed in our approach that uses the graph structure DGD.
We used two topic sets provided by INEX SBS task in 2014 (680 topics) and 2015 (208 topics). The systems retrieve 1000 documents per topic. The experimental results, which describe the performance of the different retrieval systems on INEX document collection, are shown in Table 2.
As illustrated in Table 2, the system that combines probabilistic model InL2 and the language model SDM (InL2 SDM) achieves a significant improvement comparing to InL2 model (baseline). The two systems do not provide similar level of retrieval effectiveness (as proved in [16]). The results of run InL2 DDG PR using the 2015 topic set confirm that incorporating PageRank scores using our approach based on DGD network improves ranked retrieval performance but decreases the baseline performances when using the 2014 topic set. The run that uses the MeanRatings property (InL2 DGD MnRtg) to rerank retrieved documents is the lowest performer in terms of all measurements for 2014 topic set. That means that ratings given by users do not help to improve the reranking performances for 2014 topic set.
Notice that 2015 topic set contains a subset of 2014 topic set. We can clearly observe the difference between results of systems using the two topic sets. We think that the main reason is the evaluation processes that are not the same. Where analysing the qrels of common topics between 2014 and 2015 sets, we found that relevancy values are not the same in most cases. More details on the evaluation processes used in 2014 and 2015 can be found in [31] and http://social-book-search.humanities.uva.nl/#/suggestion. Nevertheless, the depicted results confirm that we are starting with competitive baseline, the improvements contributed by combining the retrieval systems' outputs and social link analysis are indeed meaningful.

Conclusion and Future Work
This paper proposed and evaluated two approaches of document retrieval in the context of book recommendation. We used the test collection of INEX Social Book Search track and the proposed topics in 2014 and 2015. We presented the first approach that combines the outputs of probabilistic model (InL2) and language model (SDM) using a linear interpolation after normalizing scores of each retrieval system. We have shown a significant improvement of baseline results using this combination.
This paper also proposed a novel approach based on Directed Graph of Documents (DGD). It exploits social link structure to enrich the returned document list by traditional retrieval model (InL2, in our case). We performed a reranking method using PageRank and ratings of each retrieved document.
For future work, we would like to test the proposed approaches in this paper on another test collection which consists of OpenEdition Portal. It is dedicated to electronic resources in the humanities and social sciences. We would like to explore citation links between scientific documents The Scientific World Journal 7 extracted using Kim et al. [32]. Using the traversal algorithms we will develop a new way to retrieve documents. Another interesting extension of this work would be using the learning to rank techniques to automatically adjust the settings of reranking parameters.