This paper addresses the problems of similarity calculation in the traditional recommendation algorithms of nearest neighbor collaborative filtering, especially the failure in describing dynamic user preference. Proceeding from the perspective of solving the problem of user interest drift, a new hybrid similarity calculation model is proposed in this paper. This model consists of two parts, on the one hand the model uses the function fitting to describe users’ rating behaviors and their rating preferences, and on the other hand it employs the Random Forest algorithm to take user attribute features into account. Furthermore, the paper combines the two parts to build a new hybrid similarity calculation model for user recommendation. Experimental results show that, for data sets of different size, the model’s prediction precision is higher than the traditional recommendation algorithms.
Traditional collaborative filtering (CF) algorithms usually calculate similarity between users or items based on useritem rating matrix, and in the light of the calculated similarity they choose the nearest neighbor and construct prediction scores to generate recommendation lists. Therefore, the similarity calculation decides the precision and quality of recommendations produced by the heuristic CF algorithm. However, the present traditional heuristic CF recommendation algorithms suffer from a range of problems in similarity calculation, such as the failure in finding changes of user interest; that is, by directly computing similarity on the basis of statistics, it considers user ratings and center ratings only while ignoring other factors when rating, such as user attributes, time weight, and user rating habits.
In order to solve the problems of similarity calculation in traditional heuristic CF recommendation and improve its performance, Luo et al. [
Starting from different perspectives, the studies above aimed at strengthening the association between users and items to improve the similarity between users or items and get the optimal nearest neighbor set, finally improving the recommendation accuracy and quality on this basis. However, when strengthening the association between users and items, we can take some factors into account, such as the demographic characteristics of users and the time decay caused by the timeeffect of ratings, which have certain effects on the association. It is very effective to consider user attribute features when dealing with the problem of user’s cold start.
Therefore, the paper proposes a new similarity calculation method: RITUA algorithm. The RITUA algorithm consists of two parts: one is the similarities of user ratinginterest, which considers the similarities of user rating and interest as well as the changes and effects of the two under the constraints of rating time and confidence coefficient between users; the other part is the similarities of the user attributes, which takes into account the influence of the user attribute feature on the recommendation and calculates the similarity of the user attributes after getting the weight of each attribute feature. In the end, RITUA algorithm fits the two parts linearly. The experimental results show that, compared with the traditional methods, the algorithm proposed in this paper can obtain better prediction accuracy.
In studies of recommendation system, though in recent years the recommender systems have been studied frequently and developed sufficiently, there are still some common problems, such as data sparsity, cold start, and user interest drift. In order to deal with these problems and improve the recommendation precision and accuracy, researchers may take many aspects into account, including the basic user attribute feature and the time and place where the user behavior occurred, and researches about these came into being correspondingly.
Demographic Recommender System (DRS) is an important part of recommender systems. Demographic characteristics can be used to identify the user’s type and their preferences, and the system can sort users according to their attribute features and generate recommendations based on the sorting results. DRS plays a great supporting role in dealing with the problems of user cold start and data sparsity. Many of the present studies have proved that user attribute features can improve the accuracy in recommendations. Luo et al. [
With the intensive development of recommender systems research, in order to obtain better recommendations and improve recommendation quality, many researchers began to incorporate contextual information into the research of the recommender systems. Relatively speaking, the time information is easier to collect among contextual information, and it provides significant value for researches on improving the diversity of timing sequence of the recommender systems, which has become a hot topic in the current studies [
In the context of relatively sparse data, from the perspective of solving the problem of user interest drift, this paper proposes the RITUA algorithm on the basis of the traditional similarity calculation, with the introduction of factors (such as the user attribute characteristics and time decay of rating) which influence user’s rating behaviors. The RITUA algorithm consists of two parts: one is the similarities of ratinginterest, and the other part is the similarities of the user attributes.
The similarities of ratinginterest are composed of rating similarity and interest similarity, mainly considering two aspects: users’ preference for items and user’s rating habits. Meanwhile, based on the two aspects, the effect of time decay of rating is introduced and the confidence coefficient between users is also introduced with the combination of the fluctuation factor proposed in literature [
In the field of ecommerce systems, Rating or Voting is generally used to obtain the user’s direct preference for items. Assuming that the degree of user’s preference for items is classified as 5 levels, which is
Useritem rating matrix.
Item 1  Item 2  Item 3  Item 4  

User 1 


 

User 2   



User 3 

 


User 4   

 

Table
Equation (
Every user has their own rating habits. For instance, some users who do not stick to rifles always tend to give a high score, while some rigorous users who pay much attention to details is likely to give a low score. Because they are more strict with the score, they do not give high scores easily. Hence, the description of user habits is helpful to improve the prediction accuracy. For the user rating habits and the inherent attributes of the item, Koren [
Therefore, within the range of rating for items, when a user tends to score highly and likes an object, he/she usually gives a high score for it. However, even though the user does not like the object, he/she will not give a low score and vice versa. Therefore, according to the average score given by the user for an item, his/her interest and preference of rating habits can be showed. Similarly, based on literature [
Equation (
Generally speaking, treating user behaviors that occurred at various time equally leads to the shortage of effective quantitative analysis. Time factor shows the degree of changing tendency of user interest drift. The closer the rating information to the present time, the better recommendation effects it has and vice versa. Based on this, some studies used linear and nonlinear functions to quantify the rating behaviors over time.
In the literature [
Changes of Ebbinghaus forgetting curve.
When the user data is extremely sparse and the number of corated items is very small, there is a large fortuitous factor in the similarity calculation. Li et al. [
Equation (
After taking confidence coefficient into account, the adjusted equation to calculate the similarity of user ratinginterest arrives:
Considering the similarity of user attributes, on the one hand it can improve the accuracy of prediction, and on the other hand it can solve the problem of new user’s cold start; that is, when there is no other available rating data, data of user attribute features can be used to build models and give recommendations. As for the description about the similarity of user attributes, literature [
For single user attribute, it is expressed as
It indicates that when user
In (
Sections
In (
The description of RITUA similarity algorithm is in Algorithm
Algorithm
Input:
Algorithm
Therefore, from the description in Algorithm
Taking into account the openness and authority of data sets, at the same time, our simulation experiment is based on the scoring matrix, so we chose two data sets, namely, Movielens100k and Netflix, to carry out experimental analysis and comparison. The process is shown as follows.
The data set is a film rating data set provided by the GroupLens Research. The data set contains 100,000 ratings from 943 users for 1682 movies, where each user has rated 20 movies at least, and the rating interval is
Useritem rating matrix.
Movielens100k  Netflix  

Users 


Items 


Ratings 


Rating scale 


Sparseness of data 


Changing tendency of the number of userrated items (descending order).
ML100K data set
Netflix data set
In ML100k data set, there are only 4 attributes about users’ attribute feature: gender, age, occupation, and zip code.
Netflix data set is a section of the original Netflix Game data. After the proper data cleaning, the data set contains 387,939 ratings from 4861 users for 5080 objects, where each user has rated 20 objects at least, and the rating interval is
The sparseness of the data set is
In the process of cleaning the Netflix data set, since there is no user attribute feature data in it, according to the features of the user attribute data of ML100k, this paper randomly generates data of three user attributes in Netflix through the simulation experiment: gender, age, and occupation. The range of age attribute is
Generally speaking, there are evaluation quantities such as MAE (mean absolute error) and RMSE (root mean squared error) in the experimental evaluation about prediction precision in recommender systems. After comparison, RMSE (root mean squared error) is used as the evaluation quantity in this paper. The equation is
From (
Random forests are an ensemble learning method that can analyze the complicated interactive feature data, even under the influence of certain data noise it is very robust, and it is very efficient in feature learning and analysis. Its variable importance measure can be a feature selection tool for high dimensional data. In recent years, it has been widely used in various kinds of prediction, feature selection, and outlier detection [
Therefore, we obtain the weight value of each user attribute feature with Random Forest algorithm on ML100k and Netflix data set. The experimental results are shown in Figures
Ranking of the weight value of user attributes feature (ML100k).
Ranking of the weight value of user attributes feature (Netflix).
On ML100k data set, from Figure
The illustration parts of Figures
In order to test the relative optimal weight values of every and each attribute of (age, gender, occupation, and zip code) and (age, gender, and occupation) on the ML100k and Netflix data sets, we carry out several sets of comparative experiments in this paper, and experimental results are shown in Figures
Experiment comparison of weight values of different user attributes (ML100k).
Experiment comparison of weight values of different user attributes (Netflix).
According to (
Experimental results with different alpha and beta (ML100k).
Experimental results with different alpha and beta (Netflix).
From results shown by Figures
In order to verify the validity of the algorithm proposed in this paper, we compare it with other similarity measures, including the Pearson similarity, the adjusted cosine similarity (Acosine), the PIP [
Experimental comparison of different similarity algorithms (ML100k).
Experimental comparison of different similarity algorithms (Netflix).
From Figure
Based on ML100k data set, the paper chooses 20%, 40%, 60%, and 80% of the data set, respectively. Neighbors
Comparison of results produced by different algorithms on data sets of different sizes (ML100k).
From Figure
Aiming at some problems in traditional similarity calculation, this paper proposes a new similarity calculation model. The model describes and expresses aspects such as user rating preference, user rating habits, and time factor. Furthermore, user attributes feature is taken into account for its influence on user ratings, and the role of each attribute feature played in recommendation is studied. Then Random Forests algorithm is used to calculate the weight value of each attribute. The final experimental results show that, compared to other similarity measures, the approach proposed in this paper improves the recommendation precision significantly, and even in the case of sparse data it still shows better experimental results. The deficiency of experiments is that since the user attribute data is relatively small in data set, there is no obvious difference when calculating the feature weight value of user attributes, as the part of user attributes data is private and not easy to obtain, which inevitably cast a shadow on the experiments.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This project is supported by the Fundamental Research Funds for the Central Universities of Central South University with no. 2017zzts623 and Hunan Provincial 2011 Collaborative Innovation Center for Development and Utilization of the Financial and Economic Big Data Property.