S 2 NMF: Information Self-Enhancement Self-Supervised Nonnegative Matrix Factorization for Recommendation

Nonnegative matrix factorization (NMF), which is aimed at making all elements of the factorization nonnegative and achieving nonlinear dimensional reduction at the same time, is an e ﬀ ective method for solving recommendation system problems. However, in many real-world applications, most models learn recommendation models under the supervised learning paradigm. Since the recommendation performance of NMF models relies heavily on initialization, the user-item interaction information is often very sparse. In many cases, supervised information about the data is di ﬃ cult to obtain, resulting in a large number of existing models for supervised learning being inapplicable. To address this problem, we propose an information self-supervised NMF model for recommendation. Speci ﬁ cally, this model is based on the matrix factorization idea and introduces a self-supervised learning mechanism based on the NMF model to enhance the sparse data information of sparse data, and an easily extensible self-supervised NMF model was proposed. Furthermore, a corresponding gradient descent optimization algorithm was proposed, and the complexity of the algorithm was analysed. A large number of experimental results show that the proposed S 2 NMF has better performance.


Introduction
In the age of information explosion, information overload has become a central issue faced by society. Recommender systems play a vital role in solving this problem, as they help determine what information to provide to individual consumers and allow online users to quickly find personalized information that suits their needs [1]. Recently, recommender systems have become ubiquitous on e-commerce platforms, such as Amazon for book recommendations, http://Last.com/ for music recommendations, Netflix for movie recommendations, and CiteULike for references.
The main recommendation methods include collaborative filtering recommendation [2,3], content-based recommendation [4], knowledge-based recommendation [5], and social network-based recommendation [6]. Collaborative filtering recommendation generally adopts the nearest neighbor technology, calculates the distance between users by using the historical preference information of users, and then predicts the preference degree of target users to specific products by using the weighted evaluation value of the nearest neighbor users, and the system makes recommendations to target users according to this preference degree. The maximum advantage of collaborative filtering is that it has no special requirements on the recommended objects and can deal with unstructured complex objects, such as music and movies. Content-based recommendation is the continuation and development of information filtering technology. It makes recommendations based on the content information of the project, without relying on users' comments on the project. It is more necessary to use machine learning methods to get user interest information from the feature description of content. Knowledge-based recommendation can be regarded as a kind of inference technology, which is not based on the needs and preferences of users. Knowledge-based approaches differ markedly depending on the functional knowledge they use. Social networkbased recommendation has previously been mostly domain-based approach. Firstly, the social network of raters was explored, and the scores of raters were aggregated to calculate the predicted scores. And then, find the raters' neighbors.
Learning high-quality user and item representations from interaction data is the core idea of collaborative recommendation. In early studies, such as matrix factorization (MF) [7,8], a single ID of each user (or item) is projected into an embedding vector. Subsequent studies [9] enriched single IDs with interaction histories to learn better representations. Typically, nonnegative matrix factorization (NMF) [10,11], a well-known dimensional reduction method in data representation, has also been successively applied to recommender system problems in recent years [12,13]. Although NMF can be used for any nonnegative rating matrix (e.g., ratings from 1 to 5), its greatest interpretability advantage arises from the fact that users have the mechanism to specify the liking entry but not to specify a disliking entry. Such matrices include one-dimensional rating matrices or matrices in which nonnegative entries correspond to activity frequencies. These datasets are also referred to as implicit feedback datasets.
However, the NMF model is essentially a nonconvex optimization problem, and its sensitivity to initialization is unavoidable, i.e., the recommendation performance of the NMF model depends heavily on the initialization, and a poor initialization matrix can significantly degrade the recommendation performance. A general recommendation system uses only historical user-item interaction information (explicit or implicit feedback) as input, which poses two problems. First, in real-world scenarios, information about user-item interactions is often sparse. For example, a movie app may contain tens of thousands of movies, yet the average number of movies typed by a user may be only a few dozen. Using such a small amount of observed data to predict a large amount of unknown information can greatly increase the risk of overfitting the algorithm. For newly added users or items, the system does not have their historical interaction information, so it cannot accurately model and recommend them. This situation is called the cold start problem.
Moreover, most existing models learn recommendation models in a supervised learning paradigm [14][15][16], where the supervised signals are derived from observed user-item interactions. However, the observed interactions are extremely sparse compared to the entire interaction space [8,17], which makes it insufficient to learn quality representations. Moreover, in many cases, supervised information about the data is difficult to obtain, making a large number of existing models for supervised learning inapplicable.
Accordingly, this paper introduces a self-supervised learning mechanism based on the matrix factorization idea and NMF model. We propose an easily scalable selfsupervised nonnegative matrix factorization recommendation model framework based on matrix decomposition.
Self-supervised nonnegative matrix factorization (S 2 NMF) was proposed, and a corresponding gradient descent optimization algorithm was further proposed. The complexity of the algorithm is analysed. Extensive experimental results show that the proposed S 2 NMF has better performance. The main contributions can be summarized as follows: (i) Based on the idea of matrix factorization, the selfsupervised learning mechanism is introduced on the basis of the NMF model to realize the information enhancement of sparse data (ii) A self-supervised nonnegative matrix factorization recommendation model S2NMF is proposed, and a corresponding gradient descent optimization algorithm is further proposed, and the complexity of the algorithm is analysed (iii) Extensive experimental results demonstrate that the proposed S 2 NMF has superior performance on recommendation in comparison with comparison algorithm The rest of this paper is organized as follows. Section 2 briefly reviews random-walk-based, factorization-based, and deep-learning-based approaches. Section 3 presents the proposed CAHE and the corresponding optimization algorithm. Section 4 analyses the performance of CAHE, including the experimental results of node classification and clustering. Finally, Section 5 concludes this work.

Materials and Methods
This section provides a detailed description of the proposed model S 2 NMF and its model optimization algorithm and gives the pseudocode of the S 2 NMF model optimization algorithm and its time complexity analysis.
2.1. Problem Definition. Suppose there are M users U = fu 1 , u 2 ,⋯,u M g and N items V = fv 1 , v 2 ,⋯,v N g. Let the scoring matrix array R ∈ ℝ M×N , where R ij is the rating of user i for item j. If the rating is unknown, we set R ij = unk. In detail, and the symbol descriptions of the work are shown.
There are usually two ways to construct the user-item interaction matrix Y ∈ ℝ M×N .
Wireless Communications and Mobile Computing Based on literature [18], most researchers usually believe that all evaluations are equal to 1, and then, choose Formula (1) to construct the interaction matrix Y. In this paper, we choose Equation (2) to construct the interaction matrix Y, where the rating R ij of user u i for item v j remains in the interaction matrix Y. Explicit recommendation rating is very complicated for recommendation. Here, we express the user's preference degree for a certain product by Formula (2) and mark the unknown rating as 0 by the method of implicit feedback without preference. Usually, recommendation systems are formulated as a problem of estimating the rating of each unobserved item in Y.
In order to better formalize the mathematical process of this work, the detailed notation is shown in Table 1. The tasks of a recommendation system can be divided into three types: scoring prediction, top-N prediction, and click prediction. Therefore, the proposed S 2 NMF model framework is summarized as follows.
(i) Input. The observed user-item interaction matrix R (ii) Output. The predicted user-item interaction matrix R

Model Framework
3.1. Classical NMF Model. Nonnegative matrix factorization (NMF) was proposed by Lee and Seung in 1999 in Nature [19] that can achieve nonlinear data dimensionality reduction and has strong interpretability. With the extensive attention of researchers, NMF has gradually become a mature and reliable multidimensional data processing model that is widely used in recommendation systems, pattern recognition, signal processing, computer vision, and network science. It is widely used in research fields such as recommendation systems, pattern recognition, signal processing, computer vision, and network science [20]. In addition, it can reveal the potential feature-to-feature relationship quite accurately and can also be used for other related relationships between features and for related tasks, such as node importance identification [21,22], link prediction [23][24][25], and evolutionary analysis [26,27]. In recent years, many researchers and scholars have applied NMF to recommender system discovery [14,17,28], which effectively improves the accuracy and efficiency of personalized recommendation results. Normally, the user product is represented as a data matrix R ∈ ℝ N×N . The matrix R can represent the interaction characteristics of users and products, such as the rating matrix and clickthrough rate matrix. NMF decomposes the matrix R into two nonnegative matrices and optimizes them iteratively such that R ≈ WH T , where W, H ∈ ℝ N×K + , and K is the predetermined number of hidden features. In a normal situation, W denotes the basis matrix, while H denotes the data in the reduced feature space, also called the combined coef-ficient matrix of the basis. In fact, W ik can denote the probability that user i likes topic k, and H jk can denote the propensity of topic k includes item j. So I have no reason to believe thatR ij = ∑ k W ik H jk can represent the probability that user i likes item j. Then, how canR = WH T be made as close to R as possible? This involves the construction of the NMF model and the optimization process of solving it. The goal is to make WH T as close to R as possible, and it may be assumed that R ij − ∑ k W ik H jk is a Gaussian distribution obeying a mean of 0 and a variance of σ.
Assuming that R ij − ∑ k W ik H jk is independently and identically distributed, the likelihood function can be obtained from the Gaussian probability density function as The maximized likelihood can be transformed to maximize the log-likelihood, which is In Equation (5), c denotes a constant, since Similarly, if R ij − ∑ k W ik H jk is assumed to obey Poisson distribution, its log-likelihood function is distance, which can then be expressed as This corresponds to the K-L scatter, which also denotes the K-L distance. Then, maximizing the log-likelihood translates into minimizing the minimizing the K-L distance, which can be expressed as In optimizing Equations (6) and (8) [15] proposed a corresponding update rule based on the gradient descent approach. For Equation (6), the update rule is For Equation (8), the update rule is By updating the rules, Equations (9) and (10) are iteratively updated to obtain locally optimal W and H. Then, usually by reconstructing R by W and H, we obtain the complementary matrixR, which iŝ There are many deformations of NMF methods [20], among which the more commonly used is symmetric NMF (SymNMF) [29,30]. SymNMF decomposes the observation matrix A into two identical matrices, A ≈ HH T . SymNMF inherits the advantages of NMF, because the observed matrix A can fuse the similarity between data points and has fewer parameters. In addition, in 2013, Wang and Zhang [20] performed a systematic review of various expansion methods of NMF, and they classified NMF methods into the following: basic NMF, constrained NMF, structured NMF, and generalized NMF. In recent years, NMF-related models have been widely used by many researchers for graph image processing [31][32][33], complex network analysis [21,34,35], and recommendation systems.

S 2 NMF
Model. The proposed S 2 NMF model framework is shown in Figure 1. First, the super similarity matrix S is constructed by taking the score matrix R as input. Secondly, the NMF was repeated B times randomly, and B dimensionality reduction representations were analysed. Thirdly, B area indicator matrices are obtained by analytic strategy, and a new super similarity matrix S is obtained by combination and reconstruction. The above stochastic matrix factorization process is repeated to guide convergence to obtain the predicted scoring matrix. The S 2 NMF model framework proposed in this paper is an intelligent recommendation model with self-enhancement of information based on different types of NMF models and fusion of self-supervised information. Due to space limitations, this paper takes the classical NMF [19] as an example to introduce it in detail.
As mentioned earlier, NMF is required to solve a nonconvex optimization problem that is sensitive to the initialization of variables. The details are shown in Figure 1. We propose self-supervised NMF (S2NMF). By exploiting the sensitivity of NMF, the model can gradually improve the recommendation performance without relying on any additional information. First, based on the classical NMF model, R is decomposed into two nonnegative matrices W 0 and H 0 . Based on the NMF basis introduced in the previous subsection, we assume that the R ij − ðW 0 H 0 T Þ ij errors obey a Gaussian distribution. Then, the model optimization problem at time t can be constructed as where R ∈ f0, 1g M×N is the scoring matrix, W 0 is the basis matrix of M × K, and H 0 is the combined coefficient of N × K matrix. Since the NMF model factorization has some randomness, the factorization operation is repeated randomly B times in this paper.
, where K represents the number of hidden features and the number of associations. In terms of physical meaning, it represents that users can cluster into K groups of similar  to obtain the community indication matrix for each user.
Considering that this community indication matrix C b is more discriminative than the scoring matrix R, this paper constructs a super similarity matrix S as where α m is the mth element of the vector αϵℝ b×1 . This weight matrix is mainly used to balance the contribution of each association degree of each association. The obtained supersimilarity matrix S can be resolved as a recommendation indicator matrix. Again, using the super similarity matrix S as input, a new community affiliation matrix can be obtained by the NMF model, and the experiment is repeated several times to obtain a better recommendation. The experiment is repeated several times to obtain better recommendation results. This process is repeated until the stopping criterion is reached or the maximum number of iterations. We represent the above process as a constrained optimization model.
where 1ϵℝ b×1 denotes the full 1 vector. Clearly, by minimizing Equation (15), a better set of W b and H b will result in a smaller kS − W b H T b k 2 F , and accordingly, a larger weighting factor α b will be assigned. Thus, the value of α b can measure the quality of W b and H b , and by resolving H b , S can be constructed. However, there is a nonnegative constraint on α.
Equation (15) imposes an implicit weighted L 2 parametrization on α. This may lead to a rather sparse solution in the optimization of Equation (15); most elements of α are equal to or close to zero. Since our goal is to combine the contributions of multiple clusters, the extreme sparsity α is not a perfect choice. For this reason, a hyperparameter τ is introduced to control the distribution of α terms, and the final model is rewritten as where τ belongs to (1, +∞). When τ is close to 1, only a few elements of α are valid. When τ tends to +∞, the process of minimizing the equation causes equal weights to be assigned to α. Therefore, τ should not be too large or too small, and the size needs to be appropriate. In this paper, we empirically set τ to 2.
By solving the equation, the final community indication matrix C b ðb ∈ ½1, BÞ can be obtained. Meanwhile, a better super similarity matrix S and contribution vector α are determined.

Model Optimization Algorithm.
To solve the objective Equation (16), an alternating iteration strategy is proposed in this paper. First, using the fixed of S and multiple random nonnegative initialization matrices W 0 and H 0 , solving W and H and α for the objective equation, we obtain the following: In this paper, a simple and effective criterion for adaptive termination Algorithm 1 is proposed.
In this paper, a simple and effective criterion for adaptive termination Algorithm 1 is proposed, and the pseudocode is   Figure 1: Schematic of S 2 NMF model.

Wireless Communications and Mobile Computing
given in Algorithm 1. It is reasonable to assume that in the first few iterations, the association detection of all partitions can be gradually improved and the consensus between them can also be increased. When the maximum consensus is reached, the consensus among them will remain at such a high level that it may even decrease and fluctuate due to the randomness of variable initialization in the iterations. Therefore, we use a different partition for the degree of agreement between them to construct the stopping criterion.
For the objective equation of the S 2 NMF model (see Equation (17)), the derivation of the update rule is similar to that of NMF and can be found in [15]. Since this objective equation has W, H, and α multiple parameters to be optimized, it belongs to a nonconvex optimization problem.
Based on the gradient descent approach, only the other parameters can be fixed separately to optimize the current parameters. Similar to Equation (9), it is easy to obtain W ik and H jk to update the equation as Unlike the classical NMF, which fixes the parameters W and H and optimizes α, the objective equation can be rewritten as where . Then, the Lagrangian function of Equation (20) can be expressed as Taking the first-order partial derivative of α b and setting it to 0 yields the following: Since ∑ B b α b = 1, λ can be expressed as Then, bringing λ into Equation (22), we obtain α as Clearly, the numerator and denominator of Equation (24) are greater than 0. Then, we have that α b > 0 is always greater than 0, which satisfies both the nonnegative constraint of α b . The solution of Equation (24) satisfies the KKT (Karush-Kuhn-Tucker) condition of Equation (20), and it is a locally optimal solution. However, since the solution of Equation (20) is a convex problem, Equation (24) is a globally optimal solution of Equation (20).
In summary, the detailed optimization process of model S 2 NMF objective Equation (16) is summarized in Algorithm 2, and the pseudocode is given in Algorithm 2. The algorithm stops iterating if the difference in the maximum change of variables during two adjacent iterations is less than 0.001, e.g., Algorithm 1: Optimization algorithm of S 2 NMF objective Equation (16). 6 Wireless Communications and Mobile Computing computational complexity of constructing S is OðMNKÞ of S . Therefore, the complexity of each iteration of Algorithm 1 is OðMNKBN iter Þ.

Results and Discussion
To verify the effectiveness of the S 2 NMF recommendation model proposed in this paper, this section designs comparison experiments on several standard data sets, focusing on the experimental setup, experimental results, and discussion of parameter sensitivity. The computer configuration used in our experiment is as follows: CPU: I5 6500, graphics card: Sotai 1600, and memory: 16 G.

Comparison Algorithm
(i) ItemPop. This is a ranking of items based on their popularity and the number of interactions they have. It is a nonpersonalized method and usually uses performance as a benchmark for personalization methods (ii) ItemKNN [36]. This is a standard item-based collaborative filtering method used commercially by the Amazon method (iii) BPR [37]. It is a generalized personalized ranking recommendation algorithm derived from the Bayesian analysis of the problem of the maximum a posteriori estimate

Evaluation Indicators.
To comprehensively evaluate the effectiveness of the model proposed in this chapter, the experiments in this chapter use five evaluation metrics to evaluate the algorithm: recall, mrr, ndcg, hit, and precision.
These metrics examine the recommendation accuracy of the algorithm.   Table 2 under three comparison methods and five evaluation metrics are given in this subsection, and the results are discussed and analysed in detail.
In Figure 2, the histogram of comparison results for the data set ML-100 K is given. Specifically, the horizontal coordinate represents the type of evaluation metrics, and the vertical coordinate represents the five evaluation metrics calculated by recommending items to users according to the top 10 items of the rating prediction value.
In addition, the four colours represent the four different model results. As seen overall from the figure, the results of S 2 NMF proposed in this chapter are all higher than the other three commonly used benchmark methods.
In Figure 3, a histogram of the comparison results for the data set M1-1 M is given. Again, the horizontal coordinates represent the type of evaluation metrics, and the vertical coordinate represents the top 10 items recommended to users according to the rating prediction value of the 5 calculated evaluation metrics. The overall figure shows that the proposed S 2 NMF results are higher than those of the other three commonly used benchmark algorithms.
In Figure 4, a histogram of the comparison results for the Amusic dataset is given. Again, the horizontal coordinate represents the type of evaluation metric, and the vertical coordinate represents the five evaluation metrics calculated by recommending items to users according to the top 10 items of the rating prediction value. It is generally seen from the figure that the S 2 NMF results proposed in this chapter are all higher than those of the other three commonly used benchmark algorithms.
In Figure 5, a histogram of the comparison results for the Amovie dataset is given. Again, the horizontal coordinate represents the type of evaluation metrics, and the vertical coordinate represents the five evaluation metrics calculated by recommending items to users according to the top 10 items of the rating prediction value. It is generally seen from the figure that the S 2 NMF results proposed in this chapter are all higher than the other three commonly used benchmark algorithms. Figure 6 shows the sensitivity analysis of the parameters for the four data sets with respect to Algorithm 1. The horizontal coordinate represents the number of iterations, and   Wireless Communications and Mobile Computing the vertical coordinate represents the number of iterations that hit the indicator value. From the figure, it can be generally seen that the S 2 NMF proposed in this chapter is relatively sensitive to the parameter (number of iterations) on the four data at less than 4, while it does not change much at greater than 4. And then, it becomes relatively stable as the number of iterations increases. Accordingly, the S 2 NMF model can choose a relatively small number of iterations 4 to effectively reduce the computational cost without affecting the model performance.

Conclusions
Based on the matrix factorization idea, this paper introduced a self-supervised learning mechanism based on the NMF model to achieve information enhancement of sparse data, proposed an easily scalable self-supervised nonnegative matrix factorization recommendation model framework S 2 NMF, further proposed a corresponding gradient descent optimization algorithm, and analysed the complexity of the algorithm. Numerous experimental results showed that the S 2 NMF proposed in this paper has superior performance. From the contributions of this paper, the sparse data problem of user-project interaction is solved, the interpretability of the recommendation model is enhanced based on the matrix factorization idea, and the self-supervised learning mechanism is introduced to realize the information enhancement of sparse data. However, determining the number of hidden features automatically is still an urgent problem. Likewise, exploring deep hidden features and expanding them to large-scale application scenarios is an urgent problem, which has important research significance in the field of recommendation systems.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
The authors declare that there is no conflict of interest regarding the publication of this paper.