Collaborative Filtering Recommendation Algorithm Based on User Attributes and Item Score

To solve the problems of cold start and data sparseness existing in traditional collaborative filtering recommendation algorithm, a collaborative filtering recommendation algorithm based on user attributes and item scoring is proposed. Firstly, we improve the credibility of user similarity and explore the potential interests of users, a new user rating similarity calculation method is constructed by introducing confidence, item popularity, and Pearson weighting. Secondly, we construct a user attribute similarity measurement method by introducing cultural distance, age attribute similarity, and user label similarity. Finally, user rating similarity and user attribute similarity are weighted to form a new similarity measurement model. .rough simulation comparison between the collaborative filtering recommendation algorithm and the traditional recommendation algorithm, our results show that the collaborative filtering recommendation algorithm can effectively improve the accuracy of recommendations and the diversity of results and effectively alleviate the problem of data sparseness.


Introduction
With the popularization of the Internet and the development of big data, the problem of information overload is becoming increasingly or increasing prominent. Resource recommendation, as an effective method of information filtering [1,2], has high business value and research significance. Collaborative filtering recommendation algorithm [3,4] is the most commonly used and successful method in the recommendation system. However, it still has a series of problems in the specific application process, such as cold start [5], data sparseness [6,7], scalability [8], and other issues. Many researchers at home and abroad, who use different methods to improve the quality of recommendation in different degrees, have conducted in-depth and fruitful studies on data sparseness. Gao et al. [9] proposed two kinds of punishment algorithms for popular items to solve the problem of inaccurate simulation caused by popular items. Kaleli [10] used score similarity and score uncertainty difference information to search for the target users of their preference. Suryakant and Mahara [11] combined the cosine similarity, Jacquard coefficient, and other similarity calculation results for linear combination to improve the prediction accuracy. e above algorithm focuses on the user's evaluation of items. Although it improves the recommendation accuracy to a certain extent, it neglects the influence of user attribute information on the recommendation results, which is mainly reflected in that different users with similar attributes may have similar interests and preferences.
Authors in [12] designed indicators such as similarity of interest tendency and confidence and proposed a recommendation algorithm based on user rating and attribute similarity, which alleviated the problem of data sparseness. Authors in [13] combined user attributes and confidence to calculate similarity, which alleviated the data sparseness problem to a certain extent. Authors in [14] proposed a collaborative filtering algorithm based on user attributes to solve the problems of traditional algorithm. e above studies ignored the influence of unpopular items on recommendation quality, and it was difficult to find out the potential interests of users.
To ease the problem of the algorithm, this paper proposes a recommendation algorithm called user attributes and item score collaborative filtering recommendation algorithm (UAI-CF). Our main contributions can be summarized as follows: (1) e introduction of item popularity in the weighted Pearson correlation coefficient improves the recommendation probability of unpopular items, excavates the potential interest of users, and improves the confidence calculation method to improve the credibility of similarity among users and the recommendation accuracy (2) e cultural distance is introduced to construct the user attribute model, and a collaborative filtering algorithm based on user attribute and item rating is proposed to effectively alleviate the data sparseness problem (3) rough the simulation experiments on the Movielens dataset, compared with the traditional algorithm, it shows that the proposed algorithm can effectively improve the accuracy and coverage of recommendations

Classical Collaborative Filtering Algorithm.
Classical collaborative filtering algorithms are divided into user-based methods, item-based methods, and model-based methods [15][16][17]. At present, user-based collaborative filtering has been widely studied, and the specific steps are divided into user modeling, finding the nearest neighbor, and prediction and recommendation.

User Modeling.
Firstly, the user-item scoring matrix R � {r ij } m×n is established by using the user's rating information for the item. r ij is the rating of user i for the item j, and the size of the rating value is proportional to the degree of user's interest preference. en, according to the user's attribute information, the user attribute matrix A � {a uv } m×s is established. e value of the matrix is 1 or 0, which represents whether the user has the attribute.

Find the Nearest Neighbor Set.
Use the similarity calculation method to calculate the similarity between users, and generate the nearest neighbor user set with the highest similarity with the target user.
ere are three common similarity calculation methods: cosine similarity, modified cosine similarity, and Pearson similarity [18], of which Pearson similarity calculation formula is where i and j represent users, x represents items, I ij represents the set of items rated jointly by user i and j, r ix represents the rating of user i on item x, and r i is the average rating of user i on all items.

Forecast Rating.
According to the user similarity, the nearest neighbor set is constructed. Formula (2) is used to predict the target user's rating towards the evaluation item and make recommendations: where P ix represents the predicted score of user i on item x, N i represents the set of the nearest neighbors of user I, and Sim(i,j) represents the similarity between user i and user j.

Pearson Weighting Coefficient.
e weighting coefficient is weighted according to Pearson correlation coefficient, and the measurement method is shown as where r ix and r jx , respectively, represent the ratings of users i and j to the jointly evaluated item x and r iy and r jy , respectively, represent the ratings of users i and j to the respectively evaluated item y.

Confidence Degree.
Confidence is used to measure the credibility degree of similarity between users, which is usually measured by Jaccard function, and the formula is as follows: where I i represents the collection of items evaluated by user I, |I i ∩ I j | represents the number of items rated jointly by two users, and |I i ∪ I j | represents the number of unions of all evaluation items for two users, and the confidence ranges from 0 to 1.

Item Popularity.
e popularity degree is used to measure the popularity degree of an item. A larger value indicates that the project is evaluated with fewer times and the target users may have more potential interest. e calculation formula is as follows: where N(x) represents the number of items being evaluated, and the popularity of items is between 0 and 1.

Cultural Distance.
Since geographic location information is a unique factor in resource recommendation, the culture is a clustered mental programming formed by groups in different regions, which is different from that of other groups. Cultural distance refers to the degree of similarity or difference between two cultures and is a relatively standardized measurement method for quantifying cultural difference. Generally, cultural distance is used to represent the degree of cultural difference between countries [19], and the measurement method is shown as where CD(o, p) represents the cultural distance between country o and country p, q represents the corresponding cultural dimension, d qo represents the value of country o in the qth dimension, V q represents the variance in the qth dimension, and n represents the number of cultural dimensions. According to Hofstede's four cultural dimensions, power distance, uncertainty avoidance, individualism/ collectivism, and masculinity/femininity, the theoretical value range of CD

An Approach Based on User Attributes and Item Ratings
3.1. Symbol Definition. In the recommendation system, it is assumed that there are m users and n items, and the user has at most s attribute number. e user's evaluation of the item is represented by the matrix R m×n , and the user's attribute information is represented by the matrix A m×s . e user attribute A matrix is analyzed to determine whether the user is a new user. If so, the similarity between users is calculated by using the user attribute similarity Sim Attr ; otherwise, the user-item rating R matrix is analyzed and the similarity is calculated by using the Sim based on the user attribute and item scoring similarity. According to the similarity between users, the nearest neighbor set N i is obtained, and the value of the unrated items of target users P ix is predicted, and the final item recommendation scheme is obtained according to the predicted value.
For reading conveniently, the important symbols are explained in Table 1.

Similarity Calculation Based on Item Ratings.
e popularity is introduced into the weighted Pearson similarity calculation to improve the influence of unpopular items on the recommendation results and fully tap the potential interest of users.

Confidence Calculation.
Although the traditional confidence calculation formula (4) can measure the credibility of similarity between users, it has some errors. When S(i, j) � S(i, k)and the items jointly rated are all 1 item, due to the difference in users' rating habits, users are rated the same, but users and users have similar ratings on the item, the similarity of the same rating should be greater than the similarity of the similar rating, which is contradictory to the similarity of the two.
Considering that when the proportion of items jointly evaluated is the same, the greater the proportion of items with the same rating, the more credible the similarity between users. By introducing the ratio of the number of items with the same rating to the number of items with jointly rating to construct a new confidence measure function, the calculation formula is as follows: where F * { } is an indicator function:{true, false} ⟶ {1, 0} and Σ x∈Iij F{r ix � r jx } represents the number of the same item evaluated by user i and user j in the public item set.

Similarity Calculation Based on Item Popularity.
Popular items have more user evaluation information, which cannot reflect the difference of users' interests. On the contrary, unpopular items with less evaluation information can better reflect users' interest preferences. erefore, the influence of unpopular items on recommendation results should be improved. Formula (5) is introduced into the weighted Pearson similarity calculation formula (1) to obtain an improved similarity measurement formula as follows: (8) where W(i, j) is the Pearson weighting coefficient and C(x) represents the popularity of an item. It can be seen that the larger the value, the greater the contribution of unpopular items to similarity, so the influence of unpopular items on the recommendation results can be improved.

3.2.3.
e Score Similarity of Fusion Confidence and Popularity. According to the modified similarity calculation method and confidence, the user similarity measurement formula based on user rating is obtained: e popularity of the item CD � (o, p) e cultural distance between countries IS(i, j) Improved confidence Sim CP (i, j) Pearson similarity with the introduction of popularity Sim R (i, j) Rating similarity of users Sim Age (i, j) Age similarity of users Sim La (i, j) Location similarity of users Sim L (i, j) Tag similarity of users Sim Attr (i, j) Attribute similarity of users Sim(i, j) Similarity based on user attributes and rating P ix e user's predicted score for the item Scientific Programming 3 is formula can improve the accuracy of user rating similarity to a certain extent and can tap users' potential interests and hobbies.

Similarity Calculation of User Attributes.
Users generally include age, location, gender, occupation, and other basic attribute information, which to a certain extent reflects the user's interest preferences. Users with similar attributes may have similar interest preferences.

Age Similarity of Users.
Users of different ages have different interests and hobbies. For example, teenagers like to read comic books, middle-aged people like to read success books, and elderly people like to enjoy classic books. e smaller the age gap between users is, the more similar their interests and hobbies are. When the age gap exceeds a certain range, the impact of age on similarity becomes smaller [15]. e calculation formula of user age similarity is as follows: where a i,age and a j,age represent the age of the user and α is the threshold value for dividing the age range. e specific value can be obtained from the experiment.

Location Similarity of Users.
Users' interests are affected by cultural groups and hierarchies. Users in different countries form unique groups due to different regions, races, religions, language groups, and other levels. Cultural distance can be used to measure the difference between users. e cultural distance of different user attributes is normalized, and the similarity function of user location attributes is obtained as follows: where o i represents the country of user i and CD min and CD max , respectively, represent the minimum and maximum cultural distance in all data.

Tag Similarity of Users.
Users of the same gender may have common interests; for example, males like to watch finance and economics while females like to watch soap operas. Users of the same profession generally have the same interests and hobbies, such as painters like paintings and collectors like collection. e similarity calculation formula for users' tag attributes is as follows: where l represents the number of the same tag attributes of user i and j and s represents the total number of user tags. Formula (13) is used to calculate the user's overall attribute similarity according to the user's age attribute, location attribute, and label attribute similarity: where W a represents the weight of the attribute similarity, with a value range of [0,1], and Sim a represents the attribute similarity.

Similarity Calculation Combining User Attributes and Item
Ratings. e improved similarity calculation combines the similarity based on item rating and similarity based on user attributes. Firstly, the similarity based on item score was calculated. Secondly, the similarity based on user attributes is calculated. en, the two similarity degrees are fused, and the calculation formula is as follows: e algorithm can alleviate the problems of cold start and data sparseness. For new users, the similarity between users is calculated by using the user attribute matrix, which can effectively alleviate the cold start problem caused by new users entering the system. When the number of user rating data increases, user attributes are fused on the basis of user rating similarity to enhance the similarity between users, and the problem of low similarity between users caused by data sparseness is alleviated.

Similarity Calculation Combining User Attributes and
Item Ratings. Algorithm 1 used in this study to calculate the scoring similarity has the same complexity as the traditional similarity algorithm (Pearson similarity), namely, each two user attributes needs 3 times weighting and 2 times addition.
is method of calculation costs very little. When the similarity of user ratings is calculated, the weighted correlation coefficient calculation formula is multiplied by the improved confidence and popularity.

Dataset.
e classic MovieLens dataset was used for the experiment, in which a dataset on movie evaluation is provided by GroupLens Research Group of the University of Minnesota. Among them, there are 100000 rating data of 1682 movies from 943 users, and the rating of movies is an integer between 1 and 5. e size of the rating value is proportional to the degree of users' interest and preference, and the sparseness of the dataset is 93.7%. e ML-100K dataset provides 5 training sets and corresponding test sets, from which a pair of simulation experiments are randomly selected.

Evaluation Indicators.
To better evaluate the algorithm, MAE, precision, recall, and coverage are used to measure the algorithm.

Average Absolute Error (MAE).
Since the accuracy of prediction is inversely proportional to the size of the average absolute error (MAE), the following formula is used to measure the recommendation accuracy: where T is the test set and p ix is the predicted score of users to the item calculated by formula (2), which is the true score of users to the item.

Precision.
Precision indicates how many of the recommended items are correctly recommended. A higher value means a higher accuracy of recommendation. e calculation formula is as follows: where R(u) is the recommendation items and T(u) is the actual items in the test.

Recall.
Recall indicates how many of the recommended items are predicted. e value indicates the accuracy of the recommended results. e calculation formula is as follows:

Coverage.
Coverage describes the long tail taping ability of the recommendation system and whether the recommended items are effective to users. e larger the coverage value is, the more evenly the recommended items are distributed. e calculation formula is as follows: where R(u) is the set of items recommended by users and n is the total number of items in the dataset.

Result Analysis.
e threshold values are 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10, and the corresponding MAE values are calculated, respectively. On the ml-100 k dataset, when the value is 5, the value of MAE is the minimum. erefore, the threshold value is set to 5. e final similarity calculation method in this study was recorded as UAI-CF, which was compared with the traditional collaborative filtering algorithm (Pearson) and the collaborative filtering algorithm based on user features and similar confidence(UAS-CF) proposed in [15]. e three algorithms were compared from the four aspects of mean absolute error, accuracy, recall rate, and coverage rate to verify the recommendation quality of the algorithm. Input: target user, user-item rating matrix, and user attribute matrix. Output: predicted score matrix of target users.
Step 1: establish user-item rating matrix by using user rating information. Calculate the confidence and similarity based on popularity degree by using formulas (7) and (8).
Step 2: integrate the confidence degree and similarity based on popularity degree, and obtain the user rating similarity by using formula (9).
Step 3: calculate age similarity, location similarity, and user tag similarity between users by using user attribute information.
Step 4: substitute the obtained age similarity, location similarity, and user tag similarity into formula (13) to obtain the final similarity of user attributes.
Step 5: calculate the item rating similarity and the final similarity of user attributes by using formula (14) to obtain the similarity of the final user and the nearest neighbor set of the target user.
Step 6: use formula (2) to predict the value of the unrated items of the target user, and get the predicted rating matrix according to the predicted value.

Scientific Programming
In order to compare the proposed algorithm with the traditional collaborative filtering algorithm (Pearson similarity) and the differences in MAE of the proposed algorithm, the number of neighbors of the target user is taken as the variable, and the number of neighbors is increased from 5 to 50 with an interval of 5. e change of MAE values of the three algorithms is shown in Figure 1. It can be seen from Figure 1 that, with the increase of neighbor users, Pearson algorithm of MAE decreases gradually, while the UAS-CF's and UAI-CF's MAE increased first, and then into balance.
is is due to that the target users do not like the recommended movies. When the number of neighbors increases to a certain number, the recommended movies are primarily determined by the users who are most similar to the target users. When the nearest neighbor user is the smallest, MAE takes the minimum value, indicating that the algorithm in this study can alleviate the cold start problem. e changes in precision values of the three algorithms are shown in Figure 2. With the increase of nearby users, the accuracy of the three algorithms decreases gradually, and the UAR-CF algorithm is better than the other two algorithms. Figure 3 shows the change of recall rate values of the three algorithms with the increase of the number of near neighbors. e recall rate of the three algorithms gradually increases with the increase of the number of near neighbors, and UAI-CF algorithm is better than the other two algorithms. Figure 4 shows the coverage of the three algorithms increases with the increase of nearby users. is is because the increase of nearby users leads to the increase of movies to be recommended. Compared with [15] and traditional algorithms, the proposed algorithm in this study clearly shows that the coverage rate of UAI-CF algorithm increases faster, and it is superior to the other two algorithms in terms of long tail effect.
To sum up, the improved UAR-CF algorithm in this study is better than the comparison algorithm from the four aspects of mean absolute error, accuracy, recall rate, and coverage rate and is more practical.

Conclusion and Prospect
is study proposes a recommendation algorithm based on user attributes and item rating, which improves the traditional collaborative filtering algorithm. In the model of this study, by collecting user evaluation information on the item, considering item popularity and confidence among users, the similarity between users is calculated by integrating user   attributes. e obtained results validated the goodness of the proposal in terms of improved the recommendation accuracy to a certain extent and solved cold start and data sparseness on the recommendation results. As future work, we will try to integrate item attributes and introduce a more accurate method training model to further explore the relationship between users and items and improve the quality of recommendation.

Data Availability
e data used to support the findings of this study are available from the corresponding website: https://grouplens. org/datasets/movielens/100k/.

Conflicts of Interest
e authors declare that they have no conflicts of interest.