In the social network, similar users are assumed to prefer similar items, so searching the similar users of a target user plays an important role for most collaborative filtering methods. Existing collaborative filtering methods use user ratings of items to search for similar users. Nowadays, abundant social information is produced by the Internet, such as user profiles, social relationships, behaviors, interests, and so on. Only using user ratings of items is not sufficient to recommend wanted items and search for similar users. In this paper, we propose a new collaborative filtering method using social information fusion. Our method first uses social information fusion to search for similar users and then updates the user rating of items for recommendation using similar users. Experiments show that our method outperforms the existing methods based on user ratings of items and using social information fusion to search similar users is an available way for collaborative filtering methods of recommender systems.
Importing social information to the collaborative filtering method is proved to be an available way to improve the performance of the recommender system and alleviate the data sparseness problem caused by the cold start [
In order to address the above issues, we propose a new collaborative filtering method which uses social information fusion to search the similar users and then update user ratings of items using the similar users, which will alleviate the problem of data sparsity. Finally, we recommend the items based on the updated user ratings of items.
The remainder of this paper is organized as follows. In Section
The user-based collaborative filtering recommendation algorithm assumes the interest between similar users (neighbors) is similar [
To alleviate this problem, literature [
Our study finds that every piece of social information reflects an aspect of the user. This paper proposes a novel method based on social information fusion instead of the method based on user ratings of items to search neighbors.
Based on the experimental data provided by KDD CUP 2012 track1, this paper divides a user’s social information into personal description information (Profile), relationship information (Follow), behavior information (Action), and interest information (Interest). Therefore, the social information fusion is expressed as
The similarity between the users is expressed as
The description of each piece of information and its similarity calculation are as follows.
Personal description information includes gender (sex), age (age), and label (tag). The similarity of personal description information is defined as
Relationship information contains the focusing information (followee) and fans information (followers). Relationship information similarity is expressed as formula (
Focusing information and fans information are mutual; for example, user
Behavior information includes comment number (Comment), pointing frequency (At), and forwarding frequency (Retweet). Similarity of behavior information is expressed as follows, and
The above three types of data have the same forms, which led to the same evaluation methods. Taking comment number information, for example, creates a comment matrix based on the number of users. In comment matrix, each row stands for a user’s number of comment given to other users, and each column stands for the number of comments of a user received. User
Experimental data provide the keywords and the corresponding weights of keywords, both extracted from user’s microblog comments, and classify all the keywords. According to this information, we could calculate the similarity of the user’s interest information.
As is shown in Figure
The relationship of user’s interest.
The calculation process is as follows.
Set keywords set as
Set the user-classification weight which we are going to calculate as
Regard
Then, the similarity calculation of interest information between two users is as follows:
User-based collaborative filtering recommendation creates a rating matrix based on user ratings of items and predicts the possible rating of the target user in the unrated item according to the rating matrix. The recommendation sequence is then generated based on the predicted rating. The sparseness of the rating matrix in the actual data is one of the key factors affecting the accuracy of the entire recommendation. The matrix filling technique [
Based on the experimental data, the data provides a representation of whether the user likes the target item, each of which contains several keyword attributes. In the data, “1” means like, and “-1” means dislike. We use “1” or “-1” as the user’s rating for the item and for the unrated item as “0”.
As is shown in Figure
The initial rating of the item.
Item 1 | Item 2 | Item 3 | Item 4 | Item 5 | |
---|---|---|---|---|---|
User | 1 | 1 | 0 | 0 | 0 |
User | 0 | 0 | 1 | 0 | 0 |
User ratings of items.
The data is sparse in the matrix, at the same time there are no common rating items between the two users. But there is a connection between the item they like. Taking Item 1 as an example, we divide the user
The processed rating of the item.
Item 1 | Item 2 | Item 3 | Item 4 | Item 5 | |
---|---|---|---|---|---|
User | 3/2 | 4/3 | 1/6 | 1/6 | 1/2 |
User | 1/4 | 0 | 1 | 1/4 | 1/2 |
This matrix filling process can pass the rating through the implicit relationship between the items to other related items, showing an implicit rating. At the same time, since some ratings of items are the same (1 or -1), personal preferences cannot be highlighted. Through this process, they can be distinguished.
Regard the sets of items as
Create an array,
Select
Select
If
If there is any comment in
If there is any comment in
Add original ratings to
Finally, represent the values in
According to the methods in Section
This section compares the ideas presented in this paper through two sets of experiments. The experimental content includes the user’s relationship information’s effect to similar users; the collaborative filtering method based on matrix filling algorithm.
The experimental data in this paper is derived from the data provided in the KDD CUP 2012 track1. The data is based on Tencent Weibo for a period of time, extracting all kinds of social information of users. These include user information (gender, age, number of microblogs, and tags), relationship information, behavior information (forwarding information, pointing information-@, and rating information), keyword information (keywords, weights, and classifications), and items accepted and rejected by users and their ratings. Since the data-related content is encrypted, it is inconvenient for data screening. We screened the data as follows.
Randomly select 20 users whose age information is between 15 and 40 years from all users and put them in the queue.
Dequeue a user from the queue, use the user as an experimental user, and put the friends he follows into the queue.
Repeat the second step, when the number of experimental user groups reaches the target number.
After the progress in above is completed, we get 10001 experimental users. The related information includes 846077 pieces of item rating information, 821666 pieces of relationship information, 381208 pieces of behavior information, 16267430 pieces of keywords information, and 363 keywords’ classification information.
First, we divide the 10001 users into groups, each group consisting of two users, and then we get 50005000 similarity computing groups. For each group of users, according to the calculation process described in this paper, age similarity, gender similarity, label similarity, following similarity, fan similarity, forwarding similarity, pointing similarity, comment similarity, rating similarity, and interest similarity are calculated, respectively, as shown in Figures
The comparison of age, gender, and label similarity.
The comparison of fans and following similarity.
The comparison of comment, pointing, and forwarding similarity.
The comparison of rating and interest similarity.
As can be seen from the data in Figure
According to the similarity information calculated above, we use the personal description information, relationship information, behavior information, and interest information to perform the selection calculation of similar users and then find the fusion parameters in each piece of information. The evaluation criteria of similar users are based on the user’s own followers. The description is as follows: select one user
In this formula,
Taking behavior information as an example, according to the fusion calculation methods,
The MAP results’ transformation of behavior information.
The horizontal axis represents the coefficient. From top to bottom, they are the coefficient of point information, forwarding information, and comment information. As shown in the graph, finding similar users based on behavior information is mainly affected by the similarity of the forwarding information. Each peak is generated when the forwarding information coefficient has a larger value, and each trough is generated when the forwarding information coefficient is the smallest.
We will get the maximum value of the map when the value on X-axis is “253”. So the formula of convergence computation is expressed as
In the same way, we give the weight of personal description information and relationship information, as shown in Table
The coefficients.
The formulas | Parameter |
---|---|
The fusion formula of personal description information | the coefficient of label information: 0.5; |
the coefficient of age information: 0.25; | |
the coefficient of gender information: 0.25 | |
| |
The fusion formula of relationship information | the coefficient of following information: 0.5; |
the coefficient of fans information: 0.5 | |
| |
The fusion formula of behavior information | the coefficient of comments information: 0.3; |
the coefficient of pointing information: 0.2; | |
the coefficient of forwarding information: 0.5 |
Next, we conduct experiments to find similar users according to the combined personal description information, relationship information, behavior information, and interest information and rating information. In the experiment, the Top-N values were taken from 5 to 40, and the average map value of the experimental users is shown in Figure
The MAP results of each attribute.
As can be seen from the figure, with the growth of Top-N, it is better to calculate similar users according to social information than other methods. However, the map value calculated based on the rating information is low, and it is not suitable to find similar users based only on the content. The behavior information is better when the value of Top-N is lower.
The experimental data can show that, due to the sparse user ratings, implicit ratings are not enough to give reasonable similar users. Similar users based on relationship information and behavior information are better than rating information. According to the idea of fusion and through a large amount of data calculation, a similar user calculation formula
The comparison of final experiment result.
The experimental data provided a record of 845,727 use ratings of items involving 3,775 related items. According to the method of finding neighbor users in Section
The proportion of rating matrix sparsity.
The proportion of nonzero | ||
---|---|---|
Before processed | After processed | |
The original rating matrix | 0.044 | 0.397 |
The positive ratings | 0.018 | 0.174 |
rating matrix |
According to the processed rating matrix, we repredict the current user’s ratings of the item according to the formula in Section
(1) Recommending directly without matrix sparsity processing.
(2) Recommending after matrix sparsity processing.
(3) Only considering the positive evaluation of the user, getting the rating matrix, and then performing the recommendation after matrix sparsity processing. Figure
The experimental results of item recommendation.
According to the experimental results, we can get the following:
(1) The recommendation effect after sparse processing of the rating matrix is better than the recommendation effect of not processing the rating matrix. When the value of Top-k is 5, the two are basically the same. As the Top-k grows, the difference is obvious. When the value of Top-k is between 30 and 45, the difference between the two is relatively obvious. According to the result, when the matrix sparsity processing is not performed, the item with more original rating of “1” cannot be selected as the recommendation result.
(2) Negative ratings have a certain influence on the recommendation effect. The rating includes a positive rating with a value greater than zero and a negative rating with a value less than zero, with a value of zero indicating that the user has not rated. In the data, we found that the ratio of the negative rating is more important than the positive rating. After the rating vector processing, the values of more ratings are expressed as negative numbers. In order to reflect the impact of negative ratings, we removed the negative ratings, only considering the impact of positive ratings and conducting recommendation experiments. As can be seen from Figure
In this paper, based on the data set of KDD CUP 2012 track1, we study the users’ social information and propose a new collaborative filtering method based on social information fusion. Experiments show that our new method outperforms the method based on user ratings of items and the social information fusion is an available feature for recommender system. In the future, we will study how to extract more effective social information fusion features using the deep learning method.
The data is a public dataset and everyone can download from https://www.kaggle.com/c/kddcup2012-track1. In Section
The authors declare that they have no conflicts of interest.
This work was supported by the National Natural Science Foundation of China (61672040) and the North China University of Technology Startup Fund.