Hierarchical Aggregation for Reputation Feedback of Services Networks

Product ratings are popular tools to support buying decisions of consumers, which are also valuable for online retailers. In online marketplaces, vendors can use rating systems to build trust and reputation. To build trust, it is really important to evaluate the aggregate score for an item or a service. An accurate aggregation of ratings can embody the true quality of offerings, which is not only beneficial for providers in adjusting operation and sales tactics, but also helpful for consumers in discovery and purchase decisions. In this paper, we propose a hierarchical aggregation model for reputation feedback, where the state-of-the-art featurebased matrix factorization models are used. We first present our motivation..en, we propose feature-based matrix factorization models. Finally, we address how to utilize the above modes to formulate the hierarchical aggregation model. .rough a set of experiments, we can get that the aggregate score calculated by our model is greater than the corresponding value obtained by the state-of-the-art IRURe; i.e., the outputs of our models can better match the true rank orders.


Introduction
With the advances and rapid proliferation of Web 2.0 innovations, many sites on the World Wide Web offer consumers the possibility of sharing their experiences with products and services through reviews and ratings. Consumer feedback can not only rank a wide variety of online offerings, but also enable ease of discovery of more useful products and build trust in marketplaces. Moreover, positive consumer feedback contributes to increase in visibility and sale of offerings [1,2].
erefore, an accurate model of consumer feedback aggregation is absolutely critical for decision-making and marketing strategies of marketplaces, which can help users avoid bad choices and drive them toward more useful items.
Our goal in this paper is to study the problem of modeling consumer feedback from large-scale sale data in order to support personalized and scalable recommendation and demand-forecasting systems. We focus on modeling hierarchical aggregation method for reputation feedback of services networks. Figure 1, shopping is an individual or household's day-to-day activity, which can be simply divided into three stages, i.e., category purchase, product choice, and purchase quantity. For example, Amy would like to buy a carton of milk. When she wanders around fat-free milk and whole milk, she must do a choice. If fat-free milk, she should select a brand, finally deciding the quantity. Actually, the above purchase process indicates Amy's preferences.

Motivation. As shown in
Product preferences are generally reflected by purchase incidence or purchase quantity in a consumer's shopping history. In the field of recommender systems, consumer preference matching is well done in item-based collaborative filtering [3] and matrix factorization technique [4]. Moreover, user preferences are also taken into account in service selection [5,6] and service composition [7][8][9][10][11]. To satisfy increasingly complex user requirements, PaaS (API-driven platform as a service cloud) allows quick composition of existing services to deliver packaged solutions. It is very important, for solution developers, to quickly assess those composite services and regular feedback on performance of component services. Only in this way can they dynamically update their compositions to ensure quality. However, during the process of assessment, consumer feedback plays a decisive role, which is dynamic and ephemeral. So, it is very crucial to efficiently aggregate consumer feedback.

Hierarchical Aggregation.
To address the challenging problem about aggregations of consumer feedback, in this paper we present a hierarchical aggregation model for reputation feedback.
As shown in Figure 1, we model user shopping as a threestage decision-making process (so does service composition, i.e., service provider selection, service categories choice, and quantity decision for each category of atomic service). In a real-world supermarket, we usually display products either based on an existing commodity hierarchy or by clustering their associated characteristics (e.g., text descriptions). For each category, it may consist of some kinds of products where consumers' purchase decisions share similar patterns. In womenswear department about sports style, for example, maybe you can find Adidas or Nike jackets. However, because of different user preferences [12][13][14][15], in a concrete purchase decision-making process, stages are heterogeneous.
In our model, we regard user category purchase as a binary prediction problem, where a multinomial distribution is explored to model the category purchasing process.
en, user will choose one product. However, user determines what quantity of a product, which is up to a numeric prediction problem. Our reputation feedback produce procedure where binary, categorical, and numeric prediction are combined, is quite different from that used by traditional ways of aggregating feedback. So, new approaches must be developed.
In this paper, we develop a hierarchical aggregation model and extend state-of-the-art feature-based matrix factorization models to include feedback as a factor. To summarize, in this paper, we make the following contributions: (a) A generalized feature-based matrix factorization approach was adjusted and applied in our hierarchical feedback aggregation model. (b) To evaluate the contribution of a node's own ratings to the aggregate score, we present a model which consists of two parts, i.e., the mean rating of the node and the mean rating of the node's universe. Moreover, the preceding models (detailed in Section 4) are used for relevance or weight estimation. (c) To effectively evaluate the contribution of a node's child nodes to its aggregate score, a model in (14) is presented, where we do not only take sons into account, but also consider siblings and cousins (siblings and cousins are almost not concerned in existing models for reputational feedback). It is a weighted mean of the aggregate score AS(a i ) of the d child nodes. For each child, its contribution is controlled by two factors, i.e., the trust value of its ratings and the importance of its contribution. (d) To illustrate the feasibility and efficiency of the proposed framework, we conduct comprehensive experiments. e experimental results show that the proposed framework is effective and efficient in the hierarchical aggregation of consumer feedback using consumer ratings. e rest of this paper is organized as follows: Section 2 surveys related work on user preference, trust, and reputation management. Section 3 extends GLMix and consider a generalized feature-based matrix factorization (FBMF) model. Section 4 details the hierarchical aggregation model for reputation feedback. Section 5 discusses the experimental settings and results. Finally, Section 6 concludes this paper and outlines future work.

Related Work
e theme of user preference has been richly studied for recommender systems in various application scenarios such as content-based approaches [16,17] and collaborative filtering approaches [3,4,18,19]. To improve performance, [20,21] both combine multiple techniques to achieve more complex tasks in hybrid recommender systems. Matrix factorization techniques are the most widely used methods in predicting the missing ratings of a user-item rating matrix due to their accuracy and scalability in prediction [18,[22][23][24][25][26][27][28][29]. In particular, feature-based matrix factorization techniques have been well done in [30][31][32][33][34]. Moreover, some researchers have developed efficient tools such as SVDFeature and libFM [35,36]. Zhang et al. [37] presented a generalized linear mixed model (GLMix) for the LinkedIn job recommender system, where a scalable parallel blockwise coordinate descent algorithm was used. In this paper, we also concern user preference, but we focus on aggregating user preference by a hierarchical aggregation model. We build our model upon GLMix to fit different prediction settings.
It is also common to influence consumer behavior in making purchases based on aggregate consumer feedback [2,38]. Floyd et al. [39] reached a conclusion that the volume of reviews, review valence, and influence of reviewers have a strong influence on purchasing decisions. For measuring the aggregate consumer preferences, researchers navigated many solutions to analyze the online product reviews. For instance, Ghose and Ipeirotis did reviews ranking by a  consumer-oriented mechanism or a manufacturer-oriented mechanism, which were based on review helpfulness and review's expected effect on sales, respectively [40]. Xiao et al. [41] addressed an econometric preference measurement model, where a modified ordered choice model (MOCM) was also presented to extract aggregate consumer preferences from online product reviews. Banic et al. [42] focused on opinion mining by means of sentiment analysis, where a system was presented for collecting, evaluating, and aggregating user opinions. Zhang et al. [43] proposed a feedback aggregation approach to rank products based on the quality of reviews, which was calculated using a review's credibility as measured by helpfulness votes, relevance to the product, and the posting date of the reviews. However, all above approaches only consider product reviews rather than user ratings. ere are also several studies on trust and reputation management systems development, which aim to evaluate the reputation of services based on consumer feedback [44]. To monitor the execution of composite services, Bianculli et al. [45] presented a generic and customizable reputation infrastructure, where notifications upon changes in service reputation were allowable. In [46], Malik and Bouguettaya proposed a framework for establishing trust in serviceoriented environments, where different ratings were aggregated to derive a service provider's reputation. Similarly, Wang et al. [47] proposed a reputation measure method for web services, which could ensure the reputation measure accuracy through two phases, i.e., malicious rating detection and rating adjustment [48]. Employed subjective probability theory to do trust evaluation for composite services. Different from our work, these work focuses on reputation system construction.
Many methods have been addressed to measure aggregate consumer preferences, which can be reduced to three major approaches: survey-, behavior-, and online reviewbased. Due to the advantages of conjoint analysis which depends strongly on survey data, it was explored to do preference measuring by Netzer et al. [49]. By means of collecting users' preference data from surveys or experiments, the survey-based approach could determine how people value the different features that constitute an individual product or service [50,51]. However, they are time consuming and costly. To deal with these challenges, some work takes consumers' behavioral data into account to infer aggregate consumer preferences. For example, Fader and Hardie [52] presented a discrete choice model to measure consumer preferences for selected product features. But in [53], based on transaction data and path data, aggregate consumer preferences could be well estimated. Now, since online product reviews are available and accessible, several studies employed online product reviews to measure aggregate consumer preferences. For instance, Decker and Trusov proposed an econometric framework, which consisted of three models (i.e., Poisson's regression, negative binominal regression, and latent-class Poisson's regression models), to measure aggregate consumer preferences from online product reviews [54]. By means of analyzing the reviewers' knowledge and their opinion sentiment toward the target products, Li et al. [55] exploited a social intelligence mechanism for extracting and consolidating the reviews which could provide insights into enterprises to make decisions on product portfolio design. Different from previous work, this work focuses on the hierarchical aggregation of consumer reputation feedback.
Complex network refers to such network, which could have properties of self-organization, self-similarity, attractor, small world, or no scale. ere are abundant examples of systems composed by a large number of highly interconnected dynamical units, such as neural networks, biological and chemical systems, the Internet, and the World Wide Web. To capture the global properties of such systems, we usually model them as graphs whose nodes represent the dynamical units and whose links stand for the interactions between them [56]. In [57], the authors addressed a survey of the use of measurements capable of expressing the most relevant topological features which characterize its connectivity and highly influence the dynamics of processes executed on the complex network. In [58], the authors explored the toolkit used for studying complex systems, i.e., nonlinear dynamics, statistical physics, and network theory.
At the same time, software networks have attracted more and more attention from various fields of science and engineering [59]. In [60], the optimal software-defined network planning was investigated with multicontrollers, where an adaptive feedback control mechanism was proposed. In [61], the authors explored the community structure of a real complex software network and correlated this modularity information with the internal dynamical processes, which the network is designed to support. Pan et al. [62] presented a systematic approach to investigate the complex software systems by using the k-core theories of complex networks. Wood et al. [63] addressed communication networks through the use of software-defined networking and the use of virtualization, where a comprehensive SDN control plane was needed. In [64,65], the software key classes identification was addressed through the use of algorithms in complex networks.
Finally, service network is a typical complex adaptive system, and we can reveal the mechanism of its formation, evolution, and self-organization by the related theories and methods of complex network. For instance, in [66], the authors took advantage of the theory of complex network and existing networked software research works to explore the basic characteristics of services and service networks, such as the service network's "small world," "scale-free" characteristics and service network topology. Zhou and Wang [67] proposed a SCAS (service clustering approach using structural metrics) to group services into different clusters, where a metric A2S (atomic service similarity) was utilized to characterize the atomic service similarity. To explore the needs of support tools and service provisioning environments, [68] introduced the architecture of the opensource SONATA system, a service programming, orchestration, and management framework, where a development toolchain for virtualized network services could be fully integrated with a service platform and orchestration system. Correia et al. [69] proposed a hierarchical SDN-based Mathematical Problems in Engineering vehicular architecture, which aimed to improve performance in the situation of loss of connection with the central SDN controller. Similarly, for services networks, we model user shopping or service purchasing as a three-stage decision-making process (i.e., provider selection, service or item categories choice, and quantity decision for each category), where a generalized feature-based matrix factorization (FBMF) model is used. We also address a hierarchical aggregation model for consumer ratings, so that the true quality of offerings can be embodied. Finally, we present how to combine the above models to raise the aggregation precision. Since the work in [70] is most similar to our approach, in the experiments, we will mainly detail the work of [70].

Preliminaries
In this section, we present a generalized feature-based matrix factorization approach, which can be adjusted and applied in our hierarchical feedback aggregation model. e basic notations used in this paper are shown in Table 1.
Generalized linear model (GLM) is widely used for statistical inference and response prediction problems. For example, in order to recommend relevant content to a user, a large number of web companies utilize logistic regression models to predict the probability of the user's clicking on an item (e.g., ad, news article, and job). In scenarios where the data is abundant, constructing a more fine-grained model focusing on user or item level would mostly contribute to more accurate prediction, since both the user's preferences on items and the item's specific attraction for users can be better captured. Some work combines ID-level regression coefficients with the global regression coefficients in a GLM setting [71], and such models are called generalized linear mixed models (GLMix) in the statistical literature.
In this paper, we extend GLMix and consider a generalized feature-based matrix factorization (FBMF) model: (1) Here, L(t) is the time-aware label matrix, where each element li, u(t) indicates the label for an item i and a user u at timestamp t. Depending on the application, li, u(t) can be either a real label or a binary label. When users explicitly express their opinions on products, li, u(t) is a real label, often in the range [1,5], and li, u(t) is a binary label when predicting category purchase or product choice. e original label matrix can be transformed into a numeric matrix K(t) by means of the logit function or logarithm function. And we decompose K(t) as a product of Φ(t) and Ψ(t), where Φ(t) and Ψ(t) embody both explicit features and latent factors from items and users. For each element k i,u (t) in K(t), it can be formulated as follows: where <, > denotes the inner product. In our model, we simply decompose each prediction into three components, i.e., global effects, observed item/user-specific effects, and latent item-user interactions.
Specifically, for global effects, gi, u(t) includes a set of features for (i, u, t) and C denotes a set of global coefficients, which can be estimated but should be consistent for all (i, u, t) triples. For example, the weighted mean rating of universe of a node x and universal relevance are all such features. In fact, the second term (i.e., item/user-specific effects) is similar to the random coefficient model [72,73], which includes explicit features with item-or user-dependent coefficients. Generally speaking, in our model, contribution of node x from its own ratings and consumer credibility are explicit item-and user-related features. Finally, latent itemuser interaction is designed to capture the remaining latent effects in terms of low-rank user and item factors.

Methodology
To achieve more complex tasks or to mash up data from different data resources by using business process Item latent factors, user latent factors ς u (t) Probability of user u selecting a category ξ s′,u (t) Conditional probability of user u purchasing s ′

OR(a)
Contribution of a's own ratings CR(a) Contribution of a's child nodes MR(a) Weighted mean rating of node a UR(a) Weighted mean rating of universe of node a Rai ith consumer rating of node a C ai u Consumer credibility for Rai of node a TV(a) Trust value of ratings of node a TV a Trust votes of node a description languages, web services usually need to be composed as workflows (i.e., service processes). As shown in Figure 2, the process of constructing a service process can be simply divided into three stages, i.e., service provider selection, atomic service categories choice, and quantity decision for each category of atomic service. In this section, we present an integrated model to produce the aggregation of feedback. Users interact with services from a marketplace where both atomic and composite services are available, refer to existing feedback, and provide feedback based on their own perception. According to the different contexts, a service can independently receive direct feedback. erefore, we aggregate the feedback of a composite service based on not only its direct feedback, but also the aggregate feedback of its components. Below, we detail the hierarchical aggregation method that provides an accurate evaluation of feedback.
Given a service s ′ in service category sc, a user u, and a timestamp t, suppose there are the following definitions: SC sc u (t): user u selects the service category sc at time t; S s ′ u (t): user u selects the service s ′ at time t; Q s ′ u (t) � n: user u's selection quantity of s ′ at t is n. us, assuming that we focus on the service category sc, user u's preferences can be calculated by the joint probability of choosing a certain quantity of service s ′ in category sc; i.e., Equation (3) can be regarded as a product of three conditional probabilities which represent the preferences in previous service selection stages. By adopting different link functions in the previous FBMF formulation, these three preferences can be estimated by logistic, categorical, and quantity-based FBMF models.

Service Category Selection (C-FBMF).
For a given service category sc, user u can get the following logistic probability: where σ(·) is the sigmoid function, and s (cate) u (t) denotes a service category preference score, factorized using (2), where there is only one general "item," i.e., the service category sc.
Atomic Service Choice (S-FBMF). Next, we formulate the probability of selecting an atomic service within a service category as a multinomial distribution via a softmax formulation: . (5) Similarly, the atomic service preference score s (atom) s′,u (t) is factorized by (2).

Atomic Service Quantity Decision (Q-FBMF). e quantity of
choosing an atomic service s ′ follows a shifted Poisson distribution: where τs ′ , u(t) � exp(s (quan) s′,u (t)). Again, we apply (2) to factorize the atomic service quantity preference scores (quan) s′,u (t), and we can get the conditional expectation of atomic service quantity as which can be taken as an estimate of Q s ′ u (t). Consider the generalized hierarchy for service composition shown in Figure 3, based on the composite service decision process in Figure 2. Feedback aggregation is performed for every node at each level of the tree, starting from the bottom with the leaves. In this work, we combine all ratings for a particular node to have a single 5-star score. In short, for a node at a higher level, the aggregation score involves not only its own ratings, but also contributions from the lower-level descendants.
For a node a, its aggregate score is calculated as follows: where OR(a) denotes the contribution of a's own ratings, CR(a) represents the contribution of its child nodes, and β is a system parameter. If a has no child nodes, then β � 1, and vice versa.
We can evaluate the contribution of a node's own ratings by (9). In (9), it consists of two parts, i.e., the mean rating of the node (MR(a)) and the mean rating of the node's universe (UR(a)). If there are not numerous ratings for a, the existing ratings of its similar nodes (e.g., other instances of a) are used, as it is possible that a ′ s ratings will be analogous to the ratings of similar nodes. So, (9) is a trade-off between MR(a) and UR(a). Generally speaking, (9) is a weighted Mathematical Problems in Engineering mean such that the nodes with fewer ratings are dominated by the mean rating across similar nodes, while the nodes with more ratings are mostly dominated by its own mean rating. (10) We use (10) to calculate a node's mean rating, which is a weighted mean of k ratings received by a node. As shown in (10), Rai denotes a rating, and its weight comes from (5). e weight can indicate the utility of a service as perceived by the user. C ai u presents the credibility of user u who makes the rating and adjusts the rating accordingly. Actually, there are users who may try to drive up or down the rating score. By means of adjusting the contribution of each rating based on the respective weight of user credibility, we can lower the influence of those fake users.
Equation (11) is used to evaluate the mean rating of a node's universe. Generally speaking, the universe refers to the set of nodes similar to this node. In this work, we just consider two levels of similarity-siblings and cousins. As shown in (11), for a service node a with service category sc, it may have m siblings and n cousins with k 1 and k 2 ratings, respectively. For the m siblings, they could be instances of a, which can independently receive direct feedback. However, for the n cousins, they might come from different service categories, even from different service providers. δ1 and δ2 are sibling similarity weight and cousin similarity weight, respectively. Obviously, sibling nodes have a higher degree of similarity than the cousin nodes; i.e., δ1 may be greater thanδ2: In (8), CR(a) represents the contribution of the d child nodes to a ′ s aggregate score. We use (14) to calculate it, which is a weighted mean of the aggregate score AS(a i ) of the d child nodes. For each child a i , its contribution is controlled by two factors, i.e., the trust value of its ratings and the importance of its contribution, which are denoted as   i ) and w(a, a i ), respectively. w(a, a i ) can be decided by a i ′ s age, functionality, frequency of usage, etc. From (14), we can conclude that all a node's descendant nodes contribute its aggregate score:

TV(a
We define trust value by (15), which is an arithmetic mean and consists of two parts, i.e., a node's own trust votes TV a and the trust values of its d child nodes TV(a i ). e trust value of a node is a measure of consumer confidence in its ratings and can be used as a replacement of the number of ratings for a service.
By means of summing the multiplication of k feedback relevance ξ a,u (t) and the respective consumer credibility C ai u received by the node, we can get the trust value of itself for a node.

Datasets.
In this section, we conduct experiments to evaluate our approach. We compare our FBMF with the method detailed in [70] on multiple public real-world datasets, which are extracted from Amazon.com by McAuley et al. [74]. e datasets contain product reviews (i.e., ratings, text, and helpfulness votes) and product metadata. Specifically, the metadata includes price, title, a list of also viewed products, and a list of also bought products. We preprocess all datasets so that each user rated at least four products. Table 2 details the statistics of our datasets, which include five datasets, i.e., Baby, Office Products, Pet Supplies, Electronics, and Sports and Outdoors. In Figure 4, the number of rated products in each dataset is counted, respectively.
All experiments are implemented in Java. e hardware environment is a machine with the Intel ® Core ™ i5 CPU 760, 2.80 GHz, and 4 GB RAM running Windows 7 (64-bit version).

Relevance Estimation.
In Section 4, we use (5) to model input relevance, i.e., the utility of a service as perceived by the consumer. In Amazon, we can find "N people found this helpful" for each review along with Yes and No buttons. Many online malls similarly allow customers to upvote or downvote those posted reviews, which can present an idea about their relevance and be formulated as follows [70] (for simplicity, we call this method IRURe): Rel � Us Ts max Re l is a weighted mean of the initial relevance (IRe, the former part of (17)) and the universal relevance (URe, the final part of (17)) of a review, where Us denotes the upvotes on a review, Ts is the total votes on a review, and Ts max is the maximum total votes across all reviews in the universe.
In the next section, we will conduct several groups of experiments to evaluate the effectiveness and robustness of our approach. (8)) and IRURe (detailed in [70]) can get an aggregate score for a node, respectively. A higher aggregate score means a bestselling product or a more popular service, but is that really the case?

Experimental Results. Both our model FBMF (in
Actually, it is really difficult to evaluate the true quality of a product due to the subjectivity in the process. To deal with this problem, many researchers try to evaluate the effectiveness of a product ranking system using the sales rank feature of products [39], where the relative rank of a product in a given category is indicated by the amount of its sales. In our experiments, for the five datasets (i.e., Baby, Office Products, Pet Supplies, Electronics, and Sports and Outdoors), we choose the top five aggregate scores, respectively.
en, under each dataset, we take pairwise comparison of true relative sales ranks of products with the ranking order generated by the mentioned models. rough experiments, we analyze how well the outputs of the models match the true rank orders; i.e., a higher aggregate score should translate into a better (smaller) sales rank.
In our experiments, we use the sales rank values in metadata, which are extracted from Amazon.com by McAuley et al. [74]. e below five tables, Tables 3-7, are the  Mathematical Problems in Engineering experimental results for the five datasets, respectively. Among those tables, the first column is the IDs of two compared products. e second and the third columns correspond to aggregate scores obtained by IRURe and FBMF, respectively. For simplicity, all aggregate scores are normalized into the range of zero to five. e corresponding sales ranks for pairwise products are presented in column 4. e two rightmost columns show the accuracy of the models    IRURe and FBMF in capturing the true rank ordering of the products.
As we can see from Tables 3-7, on each product, the aggregate score calculated by our model is greater than the corresponding value obtained by IRURe. is is attributed to our relevance model, which is detailed in Section 4. e results among the five datasets show that the pairwise orderings generated by FBMF always capture the relative ranking of the products and are better than (or as good as) the ones generated by IRURe. For example, on Baby's dataset, IRURe missed five pairwise orderings, but FBMF missed only two ones. Particularly, on Electronics and Pet Supplies, FBMF hits at all.
In each dataset, there are tens of thousands of product reviews, so we cannot list all the pairwise products in a table. For simplicity, the respective five products corresponding to the top five aggregate scores are chosen to be displayed in Tables 3-7. However, for each dataset, we did all the pairwise comparisons, where those products with reviews and sales ranks were all covered. Figure 5 is the statistical results about hit rates throughout the five datasets. As shown in Figure 5, FBMF has a higher hit rate than IRURe in each dataset. Particularly, in Electronics, FBMF even has a hit rate of 93.56%. e results for FBMF vs. IRURe reconfirm that FBMF is able to capture the true relative order, although IRURe also has the same capability in the most cases.

Conclusions
Consumer feedback, for example, product review, is an important source of information for customers to support their buying decision. ough product reviews are really helpful for customers, aggregate responses from the participants indicated that current rating systems also have their weaknesses, especially when review scales are large. It is an important but difficult task to develop a new feedback mechanism and management of feedback aggregation. In this paper, we propose a hierarchical aggregation model for reputation feedback, which is based on a generalized feature-based matrix factorization model. is model aims to aggregate consumer feedback from large-scale sale data in order to support personalized and scalable recommendation and demand-forecasting systems. We conduct several groups of experiments to evaluate the efficiency and robustness of our approach. Experiments show that FBMF performs well. Currently, we mainly consider ratings. Our future work is to investigate how to incorporate the information of "also viewed products" and "also bought products" into our approach.
Data Availability e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.  Mathematical Problems in Engineering 9