Collaborative Filtering Recommendation Using Nonnegative Matrix Factorization in GPU-Accelerated Spark Platform

Nonnegative matrix factorization (NMF) has been introduced as an efficient way to reduce the complexity of data compression and its capability of extracting highly interpretable parts from data sets, and it has also been applied to various fields, such as recommendations, image analysis, and text clustering. However, as the size of the matrix increases, the processing speed of nonnegative matrix factorization is very slow. To solve this problem, this paper proposes a parallel algorithm based on GPU for NMF in Spark platform, which makes full use of the advantages of in-memory computation mode and GPU acceleration.(e new GPU-accelerated NMF on Spark platform is evaluated in a 4-node Spark heterogeneous cluster using Google Compute Engine by configuring each node a NVIDIA K80 CUDA device, and experimental results indicate that it is competitive in terms of computational time against the existing solutions on a variety of matrix orders. Furthermore, a GPU-accelerated NMF-based parallel collaborative filtering (CF) algorithm is also proposed, utilizing the advantages of data dimensionality reduction and feature extraction of NMF, as well as the multicore parallel computing mode of CUDA. Using real MovieLens data sets, experimental results have shown that the parallelization of NMF-based collaborative filtering on Spark platform effectively outperforms traditional user-based and item-based CF with a higher processing speed and higher recommendation accuracy.


Introduction
In recent years, the scale of data has grown exponentially. Globally, it is becoming a trend to research and develop big data technology, use big data to promote economic development, improve social governance, and improve government services and regulatory capabilities. How to effectively extract knowledge from big data, understand and analyze it, and finally make predictions are current popular research topics.
As an important mathematical tool for big data processing, nonnegative matrix factorization is a matrix decomposition approach that decomposes a nonnegative matrix into two low-rank matrices constrained to have nonnegative elements [1,2]. is results in a reduced representation of the original data that can be seen either as a feature extraction or as a dimensionality reduction technique. e widespread usage of the NMF is due to its ability of providing new insights and relevant information about the complex latent relationships in experimental data sets.
Since Lee and Seung's Nature paper [1], NMF has been extensively studied and has a great deal of applications in science and engineering. It has become an important mathematical method in machine learning and data mining and has been widely used in feature extraction, image analysis [3], audio processing [4], recommendation systems [5,6], pattern recognition, data clustering [7], topic modeling [8], text mining [9], bioinformatics [10], and so forth. Unlike other factorization methods (e.g., PCA, ICA, SVD, VQ, etc.), NMF can be interpreted as a parts-based representation of the data because only additive combinations are allowed. In contrast to PCA and ICA, NMF strictly requires that the entries of both resulting matrices be nonnegative. Such a constraint is very meaningful in many applications, in which the data representation is purely additive; for instance, the user ratings of e-commerce websites are usually nonnegative values, and image pixels are nonnegative values. e main problem of NMF is that the original matrix is usually high-order matrix, which makes the computational complexity very high. erefore, the parallel algorithm of NMF gradually attracts more attention, and some parallel NMF algorithms have been proposed. Although the parallelization of NMF can improve the computational efficiency to a certain extent, parallel algorithms should be matched to the machine hardware architecture and should have strong scalability, that is, the ability to effectively utilize increased processor resources.
Accelerating HPC applications is currently under extensive research using new hardware technologies such as the recent Central Processing Units (CPUs) that are getting multiple processor cores for parallel computing, Graphics Processing Units (GPUs) that process huge data blocks in parallel, and hybrid CPUs/GPUs computing which is a very common solution for HPC. GPUs are getting more attention than other HPC accelerators due to their high computation power, strong performance, functionality, and low price. e modern GPU is not only a powerful graphic engine but also a highly parallel programmable processor featuring peak arithmetic and memory bandwidth [11]. ey are now used to accelerate graphics and some general applications with high data parallelism (GPGPU) due to the availability of Application Programming Interfaces (APIs), such as Compute Unified Device Architecture (CUDA) and Open Computing Language (OpenCL).
Spark is a distributed in-memory computation framework that was proposed by AMPLab of University of California, Berkeley, in 2009 and is based on a framework of processing large amounts of data in memory [12,13]. It supports four programming languages, Scala, Java, Python, and R. Resilient Distributed Datasets (RDD) is a new concept proposed by Spark for data collections. RDD can support coarse-grained write operations [14]. Spark caches a particular RDD into memory, and the next operation can read directly from memory. e data is not written to disk, saving a lot of disk I/O overhead. Experimental performance evaluation confirmed that Spark's performance has increased by dozens or even 100 times compared to Hadoop, which relies on MapReduce model [15,16] and data being stored in a distributed file system called HDFS rather than in memory.
Currently, some parallel approaches for nonnegative matrix factorization have been proposed, for example, highperformance approaches using message passing interface (MPI) [17], GPU-accelerated approaches [9,18], and Hadoop-based MapReduce approaches [10,19]. ese approaches mainly utilize the multicore characteristics of the system, and there is still the potential to improve performance by utilizing memory, CPU, and GPU resources together.
Meanwhile, collaborative filtering is a method of making automatic predictions (filtering) about the interests of a user by collecting preferences from group users (collaborating). e underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on an issue, A is more likely to have B's opinion on a different issue than that of a randomly chosen person.
rough calculating the data similarity between two users, we can get the similarity between two users. Although traditional collaborative filtering is extremely successful in the recommendation system, as the data increases, the recommendation algorithms have been confronted with various problems, such as scalability problems, cold start problems, and matrix sparseness problems.
In order to solve the above problems, this paper proposes a combination of Spark-based and GPU-based acceleration model to develop scalable NMF parallel algorithm, which takes advantages of both GPU and in-memory computing, to obtain a highly scalable parallel NMF algorithm. Furthermore, this paper proposes a collaborative filtering method based on NMF and is parallelized and migrated to the Spark platform equipped with GPU to execute, which effectively improves the calculation efficiency, and thus can update the recommendation results faster and produce more accurate recommendation results. e experimental results show that the parallelization of NMF-based collaborative filtering effectively improves the calculation efficiency and accuracy. e rest of the paper is organized as follows. Section 2 surveys related work. Section 3 introduces the mathematical fundamental of NMF and describes the general parallel principle of NMF. Section 4 describes the architecture of GPU-accelerated Spark platform and presents GPUaccelerated NMF on Spark. Section 5 introduces the collaborative filtering algorithm based on NMF. Section 6 presents performance evaluation results of GPU-accelerated NMF on Spark and collaborative filtering-based NMF, which is followed by conclusions in Section 7.

Hybrid Big Data Processing
Model. Due to the diversity of hardware equipment in high-performance computing system, in order to deal with the real-world complex applications, the mix of different computing modes has become a major direction, such as hybrid CPU/GPU and CPU/ FPGA. From the model point of view, there are hybrid MapReduce/CUDA model [20], hybrid MapReduce/MIC model [21], hybrid OpenMP/MPI model [22], and so forth. In recent years, the practical experience of academia and industry has shown that computing platforms based on heterogeneous CPU/GPU system have great development potential and have attracted more and more attentions [11,23].

Parallel Nonnegative Matrix Factorization.
To handle large data sets through nonnegative matrix factorization, there are three main directions. e first class of algorithms is called online NMF algorithms [6,24,25], which are the oldest approach for dealing with high-dimensional data processing through NMF. e second class is known as distributed NMF algorithms, which distribute data over a network so that several small-scale data can be performed concurrently. e third class of algorithms is called compressed NMF algorithms [26,27], which perform structured random compression to project the data onto the lowerdimensional manifolds. In this paper, we only focus on distributed NMF.

Scientific Programming
Nonnegative matrix factorization is usually solved through alternate iteration [2], which makes it suitable for parallelization. ree aspects that restrict the scalability of the parallel NMF algorithm are listed as follows: synchronization between computing processes, data loading and data transmission, and parallel granularity division. At present, some parallelization algorithms have been proposed to accelerate nonnegative matrix factorization.
Janecek et al. used linear algebra toolkits such as BLAS, LAPACK, and ARPACK to implement multithreaded programs on a single computer to perform efficient NMF [28]. Lopes and Ribeiro implemented a GPU-based machine learning library named GPUMLib, which contains an implementation of GPU-accelerated NMF [29]. Kysenko et al. also applied GPU-accelerated NMF to text mining [9]. Battenberg and Wessel implemented a parallel NMF, using the characteristics of a shared memory multicore system based on OpenMP and a many-core GPU based on CUDA technology, and applied it to audio signal processing, but it can only work for a single node [30]. A parallel NMF based on the combination of MPI and GPU is implemented in [18], and it is used for biological sequence comparison. Tang et al. proposed a hybrid parallel hierarchical NMF algorithm based on OpenMP and MPI [31].
Using some high-performance computing software packages, such as ParMETIS, ScaLAPACK, and HPSEPS, we can develop nonnegative matrix factorization parallel algorithms based on MPI/OpenMP/GPU using these software packages, which are ultimately not suitable for practical big data processing in Internet era. On the basis of open-source big data processing framework such as Hadoop and Spark, it is a more suitable idea to develop a parallel algorithm of NMF to make it suitable for Internet big data processing. In [10], Liao et al. realized the distributed NMF based on MapReduce for biological data processing. Sun et al. realized large-scale NMF based on MapReduce in [32], and Liu et al. also proposed a distributed NMF based on MapReduce for processing largescale web data using Hadoop streaming method [19]. In our previous work [33], we proposed a parallel NMF algorithm in Spark platform, which makes full use of the advantages of in-memory computation mode.

Parallel and Distributed Collaborative Filtering.
Many e-commerce companies have already incorporated recommendation systems with their services, for example, product recommendations by Amazon (http://www. amazon.com) and Taobao (http://www.taobao.com) and movie recommendations by Netflix (http://www.netflix. com). e implementations and algorithms of collaborative filtering for the applications of recommendation systems face several challenges. First is the size of processed datasets. e second one comes from the sparseness of rating matrix, which means for each user only a relatively small number of items are rated. With the increase of a large amount of data and the complexity of the data, it is confronted with the problem of low efficiency. us, highly efficient collaborative filtering algorithm is needed.
On the other hand, these challenges of collaborative filtering have been well taken care of by matrix factorization (MF). Matrix factorization methods have recently received greater exposure as unsupervised learning methods for latent variable decomposition and dimensionality reduction [44]. It is a powerful technique to find the hidden structure behind the data.
To sum up, high-dimensional NMF is time-consuming, and there is an urgent need for high-performance parallelization solutions. At present, there is no flexible distributed processing framework that considers both the memory computing mode and GPU technologies for NMF at the same time. Considering Spark distributed processing framework and combining the powerful computing advantages of GPU and large-capacity memory, large-scale NMF parallel algorithm would enable the algorithm to be easily adapted to the processing of Internet big data. To the best of our knowledge, it is the first NMF-based collaborative filtering implementation that is parallelized and migrated to the Spark platform equipped with GPU. Experimental results validate that the parallelization of NMF-based collaborative filtering on Spark platform effectively improves the calculation efficiency and accuracy.

Nonnegative Matrix Factorization.
Nonnegative matrix factorization seeks to approximate a nonnegative n × m matrix V (in this context, a matrix is called nonnegative if all of its elements are nonnegative) by a product V ≈ WH of nonnegative matrices W and H of dimensions n × r and r × m, respectively, with a given and typically low maximal rank r. Usually, r is chosen to satisfy r � min m, n { } such that WH can be thought of as a compressed form of the original data. It forms the basis of unsupervised learning and data reduction algorithms with applications to image recognition, speech recognition, data mining and collaborative filtering, and so forth.

Scientific Programming
NMF is able to represent a large input data set as the linear combination of a reduced collection of elements named factors. In this way, W contains the reduced set of r factors, and H stores the coefficient of the linear combination of such factors which rebuilds V. NMF iteratively modifies W and H until their product approximates to V. Such modifications, composed by matrix products and other algebraic operations, are derived from minimizing a cost function that describes the distance between WH and V. Lee and Seung presented two NMF algorithms based on multiplicative update rules whose objective functions are Square of Euclidean Distance (SED) and Generalized Kullback-Leibler Divergence (GKLD), respectively [1,2]: en, the objective of NMF is converted to optimize the following: In this paper, we define SED as the objective function, so we have min(‖V − WH‖ 2 F ), which leads to the updating rules for matrices H and W:

Parallel Nonnegative Matrix Factorization.
Before describing our experimental study, we briefly introduce the main existing parallel techniques of NMF. By analyzing equations (1) and (2), we can get the basic principle of iteration calculation of NMF in parallel manner. Matrix operations are performed in blocks. e block-based parallel updating rules for matrices H and W over multiple processes are shown in Figure 1, and the size of b m can be adjusted according to the hardware configurations. At the time of initialization, initial W and H are produced. Because SVDbased initialization has been proven to be more effective for iteration of H and W [50], we generate initial W and H by the method of SVD. As you see, the size of matrix W is n × r, the size of the matrix block V j is n × b m , and the size of the matrix block H j is r × b m , and finally the updated matrix block H j is obtained. As shown in Figure 1(

Spark. Conceptually, Apache
Spark is an open-source in-memory data analytics cluster computing framework [12,13]. As a MapReduce-like cluster computing engine, Spark also possesses good characteristics such as scalability and fault tolerance as MapReduce does.
e main abstraction of Spark is RDDs, which make Spark be well qualified to process iterative jobs, including PageRank algorithm and K-means algorithm. RDDs are unique to Spark and thus differentiate Spark from conventional MapReduce engines. In addition, on the basis of RDDs, applications on Spark can keep data in memory across queries and reconstruct automatically data lost during failures. RDD is a readonly data collection, which can be either a file stored in an external storage system, such as HDFS, or a derived data set generated by other RDDs. RDDs store much information, such as its partitions, and a set of dependencies on parent RDDs called lineage. With the help of the lineage, Spark recovers the lost data quickly and effectively. Spark shows great performance in processing iterative computation because it can reuse intermediate results and keep data in memory across multiple parallel operations.

GPU-Accelerated Spark Platform. Modern
GPUs are now capable of general computing. Due to the popularity of the CUDA on Nvidia GPUs, which can be considered as a C/ C++ extension, we will mostly follow CUDA terminologies to introduce GPU computing. Current generations of GPUs are used as accelerators of CPUs and data are transferred between CPUs and GPUs through PCI-E buses. NVIDIA GPU programming is generally supported by the NVIDIA CUDA environment. A program on the host (CPU) can call a GPU to execute CUDA functions called kernel.
GPU is a multicore processor designed to parallelizable computational intensive tasks. It has very high computational processing power and data throughput. In scientific research and practical applications, the parallelizable computing task modules with less logical processing in the system are often transplanted to the GPU for execution, and a large execution performance improvement can usually be achieved.
However, Spark cluster will slow down when processing extremely large-scale data sets, especially when the node number is not very high. At the same time, more and more developers use GPUs for parallel computing to obtain high throughput and performance. Combining Spark with GPU, the mixed architecture is quickly becoming an emerging technology, which embeds the GPU into Spark, implements CPU/GPU integration, and builds an efficient heterogeneous parallel system.
In the CPU/GPU heterogeneous parallel cluster, the CUDA-based GPU acceleration technology is used, and the Spark computing tasks are accelerated by GPU. e basic idea is that a part of operations of the Spark RDD are transferred to the GPU cores. GPU code execution flows are as follows: (1) copy data from main memory to GPU global memory; (2) GPU is driven by CPU instructions; (3) GPU parallel processing is in each core; and (4) GPU returns results to main memory. According to this idea and combined with Spark workflow, the GPU code is encapsulated, and then the data is transmitted between Spark Worker and GPU. e basic principle of Spark-GPU fusion is shown in Figure 2.
From the perspective of programming language, since the GPU program is usually developed in C/C++ language and the Spark platform uses Java language for program develop, Java's JNI (Java Native Interface) technology provides a solution to bridge the GPU and Spark through code encapsulation to implement interfaces for the Worker to call. Several JNI tools for GPU programming can be used. For example, JCuda (http://www.jcuda.org) is a development kit that provides bindings to the CUDA runtime, which currently includes multiple packages such as JCublas, JCufft, JCurand, JCusparse, JCusolver, JCudpp, JNpp, and JCudnn etc. It is convenient to write GPU programs in Java language. Other user-defined GPU programs written in C/C++ can also be called after being packaged into Java functions.
For the developers, a bidirectional transmission channel between the main memory and the GPU global memory should be established. If the operation of the RDD is transferred to the GPU core, high-speed data transmission between the main memory and the GPU global memory is required, which is also implemented by function encapsulations, as is demonstrated in Figure 2

GPU-Accelerated NMF.
As we demonstrated the matrix iterative process in equations (3) and (4) and Figure 1, the main principle of GPU-based parallel NMF is presented in Figure 3. e basic idea of GPU-based parallel NMF is to design several kernel functions to implement update rules for matrices H and W. H and W are blockwise-transferred. In Figure 3, circled operations denote CUDA kernels, and "· * " and "·/" denote point-wise matrix operations, multiplication, and division, respectively. Most of the matrix operations can be implemented using the libraries of Cublas and Cusparse, together with two self-defined operations, dot multiplication and dot division. In order to reduce the     en, iteratively preform the above four stages. e method of caching data in memory is much faster than in file system for each iteration. When the convergence condition is reached, the matrices updating is terminated, and the results are then written to HDFS.

Classic Collaborative Filtering Algorithm.
Collaborative filtering recommendation algorithms can be divided into two categories: user-based CF and item-based CF. e recommendation process based on collaborative filtering can be described as three stages: Stage 1: Collect user preferences. After preprocessing the user behavior data, according to different behavior analysis methods, you can choose grouping or weighting to obtain a "user-item" preference matrix V whose size is n × m, where n is the number of users, m is the number of items, and matrix element v ij denotes the i-th user's preference for the j-th item, which is generally a floating point number in the range [1,5] or a binary value of 0 or 1. e value highly depends on the content of the item. If the item is a commodity in e-commerce, the value indicates whether the user purchased or not. Sometimes, it means whether the user watched or not, or the interest is like or dislike, or the interest is high or low. Stage 2: Discovery of similar users or items. In the "useritem" preference matrix, a user's preference for all items is used as a vector to calculate the similarity between users to obtain the similarity matrix sim. For a specific user u, from the remaining n − 1 users in the system, the similarity value corresponding to the user u is sorted in descending order; the k-nearest neighbor users with the largest similarity value are selected to form the nearest neighbor user set N � n 1 , n 2 , . . . , n k . For the item-based CF, all users' preferences for an item are regarded as a vector to calculate the similarity between items. Generally, there are three common methods for calculating similarity: Euclidean distance, Pearson correlation coefficient, and Cosine similarity.
is paper uses Pearson correlation coefficient as an example [44]. e reason why we choose Pearson correlation coefficient is that, different from the Euclidean distance, Pearson correlation coefficient is able X←gpu multiply(W T , V) (10) WW←gpu multiply(W T , W) (11) Y←gpu multiply(WW, data) (12) da ta←gpu do t multiply(data, X) (13) result←gpu do t divide(data, Y) (14) return result (15) end function (16) where u 1 and u 2 denote two users, sim(u 1 , u 2 ) is the similarity of users u 1 and u 2 , and v u1j and v u2j are the ratings of j-th items given by users u 1 and u 2 . Stage 3: Generate the prediction matrix and Top-N recommendation results. Using the score given by the nearest neighbor on the item, the user's score on the specific item is calculated through the weighted average of the similarities. Suppose that user u's nearest neighbor set N � n 1 , n 2 , . . . , n k ; user u's prediction score for an item i is denoted as v ui ′ , which is shown in the following equation: where u is the average rating of items by user u, sim(u, r) is the similarity between user u and user r, v ri is the rating of item i by user r, and r is the average rating of items by user r. en, sort the items that user i did not score or purchase according to the predicted score, and obtain the Top-N items as the recommendation data set and recommend them to user i.

Collaborative Filtering Algorithm Based on NMF.
e collaborative filtering algorithm based on NMF proposed in this paper can be divided into two processes: matrix factorization with dimensionality reduction and collaborative filtering.
(1) Matrix factorization and dimension reduction Step 1: Using GPU-based NMF, the large-scale user preference matrix V is approximated by the product of two matrices W and H. e base matrix W stands for the item feature matrix, which contains the reduced set of r factors (r is the rank in NMF), and the projection matrix H stands for the user feature matrix, which stores the coefficient of the linear combination of the r factors.
Step 2: According to matrix W, the projection vector of the target user u i 's rating vector corresponding to the base matrix W can be calculated and denoted as h i . However, choosing a suitable number of latent factors will have an impact on the effect of NMF. In this paper, in order to improve the collaborative filtering-based recommendation, we need to select the optimal rank r for NMF. According to the cophenetic correlation coefficient [51], we repeat NMF several times per rank and calculate how similar the results are and, in other words, how stable the identified clusters are, given that the initial seed is random. We choose the highest r before the cophenetic coefficient drops.
(2) Collaborative filtering Step 1: For the GPU-accelerated user similarity calculation, each user is assigned a thread in CUDA programming model, a kernel function is designed to calculate the Pearson correlation coefficient, and the similarity between h i and each column of the projection matrix H is calculated in parallel to obtain the user similarity matrix sim.
Step 2: Top k users with the highest value of similarity form the nearest neighbor set N for user u i .
Step 3: Use the neighbors of u i in the nearest neighbor set N and the corresponding original scores in V to perform weighted calculation to generate the score prediction matrix p.
Step 4: Sort and get Top-N recommendation result using the prediction matrix p.

Experiment Setups.
For our experiments, we have used four n1-standard-4 instances of Google Compute Engine, and each instance is configured with 4 vCPU, 15 GB memory, and 100 GB SSD hard disk in asia-east1 district. Each instance is also configured with a NVIDIA K80 GPU with 2496 CUDA cores and 12 GB global memory. In the 4node cluster, 64-bit Ubuntu 16.04 LTS is installed, and other software packages include Hadoop 2.7, Spark 2.3, JDK 1.8, and CUDA 9.0.

Data Set.
MovieLens data set (https://grouplens.org/ datasets/movielens/) provided by the GroupLens research group is used in the experiment. It contains the scores of 130,642 movies scored by 7,120 users. We randomly select data sets of different sizes for testing. Each user must rate at least 20 movies, the range of ratings is from 1 to 5, and the higher the rating is the more satisfied the user was. In the experiment, the movie ratings are converted into a scoring matrix. If a user does not rate a movie, the corresponding matrix element value is 0; thus the scoring matrix is a typical sparse matrix. In our experiment, the data set V we randomly selected needs to be further divided into a training set VT and a test set T through splitting the nonzero elements, and 70% of the data set as the training set and the other 30% as the test set, making sure that matrix VT and matrix T have the same size as that of matrix V.

Baseline Algorithms
Serial NMF. Serial NMF algorithm is performed in a single thread using CPU only. According to equation (3) and (4), the method of alternately updating W and H is used to obtain the decomposition results by performing multiple iterations. GPU-based NMF. GPU-based NMF algorithm is also performed in a single thread but with one GPU device support. As you see in Figure 3, alternately updating W and H is accelerated by GPU, implemented using the libraries of Cublas and Cusparse, together with two selfdefined operations, dot multiplication and dot division. Spark-based NMF without GPU support. For this algorithm, NMF is computed in a Spark cluster, and each node has no GPU device. Similar to Algorithm 1, in the two stages of r dd H.mapPartition and r dd W.mapPartition, there is no GPU support for the updating of H and W and only CPU for matrix operations in each iteration.

Evaluation Metrics for Recommendation.
In order to accurately measure the performance of algorithms, in addition to the running time, the accuracies of prediction scores and recommendation results are also considered. In this paper, we use root mean square error (RMSE) and mean absolute error (MAE) to measure the accuracy of prediction scores. For the measurement of the accuracy of recommended results, the accuracy rate (Precision) and the recall rate (Recall) are generally used for measuring, together with F-measure for comprehensive consideration of contradictions between the two indicators.
RMSE measures the accuracy of predictions based on the root mean square error between the predicted score and the actual score. e smaller the value of RMSE, the more accurate the prediction result and the higher the quality of the recommended algorithm. e prediction score set p � p 1 , p 2 , . . . , p n is obtained through training, and the actual user preference score set T � t 1 , t 2 , . . . , t n is in the test set. erefore, RMSE is defined as MAE measures the accuracy of predictions based on the average deviation between the predicted score and the actual score, which is defined as In our experiments, we use three evaluation metrics to evaluate the performance: Precision, Recall, and F-measure. Among the items that have never been purchased or rated, N items with the highest predicted ratings are selected to form the Top-N recommendation list. We define R u as the set of items recommended for user u and define T u as the set of items actually liked by user u in the test set. Accuracy means the proportion of related items in the recommended items. Simply speaking, it is recommendation hit rate (the hit means the recommended item has a score in the test set and the score exceeds a certain threshold). We define U as the set of all users, and the recommended accuracy is defined as Precision: e ratio of the correct recommended items to all items in the recommendation results is defined as Recall: F-measure is weighted harmonic average of Precision and Recall, which is defined as follows: In this paper, the training data VT is factorized by NMF, and then Top-N recommendation results are generated according to the algorithms in Subsection 5.2. e test data T is only used for calculating various recommendation evaluation metrics, such as RMSE, MAE, Precision, and Recall, without projecting the test data in the latent space created by the training data.

Result Analysis of NMF.
In the experiments, we conducted performance evaluations using four algorithms: (i) Serial NMF, (ii) GPU-based NMF, (iii) Spark-based NMF without GPU support, and (iv) Spark-based NMF with GPU support which is proposed in this paper and developed on Spark-GPU fusion platform. We designed three performance comparisons to validate the new proposed algorithm. We select some typical matrix dimensions, and the number of iterations is 100.

Performance of GPU Speedup.
We performed GPUbased NMF in a single node, and we varied the matrix dimensions as seen in Figure 4. We measured the computation time, and then we also performed the serial NMF in the same node so as to calculate the GPU speedup to validate the effectiveness of GPU acceleration. e speedup is defined as the ratio of the computation time of the single-node serial method to the computation time of the single-node GPU method; that is, Speedup � (T serial /T gpu_parallel ). e speedup varies with matrix dimensions, and we have obtained maximum speedup of 45x for GPU when compared with CPU.

Performance of NMF on Spark.
In this evaluation, we started the Spark cluster, and the number of worker nodes is varied from 1, 2, and 3 to 4. We varied the matrix dimensions Scientific Programming from 800 * 800, 800 * 1600, 800 * 3200, 1600 * 1600, and 800 * 6400 to 1600 * 3200 and measured the computation time of NMF in Spark platform, and results are shown in Figure 5. When the number of nodes is 4, we set the number of Spark executors to 16, and, with the increase of the matrix dimensions, the advantages of 4 nodes are becoming more and more obvious. Compared with 3-node Spark platform, the computation time of 4 nodes saves about 50% of the time.

Performance of NMF on Spark with GPU Support.
In the last evaluation, we started the Spark cluster, the number of nodes is 4, and we varied the matrix dimensions from 6400 * 6400, 3200 * 25600, and 6400 * 12800 to 6400 * 25600 and compared GPU support with non-GPU support. As can be seen from Figure 6, in the 4-node Spark platform, the computation time of NMF with GPU is smaller than that of NMF without GPU. When the size of matrix is 6400 * 25600, NMF on Spark with GPU support saves about 10.8% of the time. NMF on GPU-accelerated Spark platform obviously shows execution efficiency.
Due to the mathematical fundamental of NMF and the blockwise-based parallel principle, there are frequent data distributions and data collections among all executors, and the communication cost is very high for the NMF on Spark. However, compared with data distributions and data collections, the execution of mapPartition function takes much less time due to the GPU acceleration. From the perspective of time analysis, communication and data exchange are the bottlenecks of NMF parallel algorithm. NMF on GPUaccelerated Spark platform still has great potential for improvement.

Result Analysis of Collaborative Filtering.
In the experiments, we compared the performance of three algorithms: traditional user-based CF, traditional item-based CF, and the NMF-based CF proposed in this paper. e size of the matrices changes from 400 * 800, 400 * 1600, 800 * 1600, and 800 * 3200 to 1200 * 3200 for testing. e number of iterations is 100, and we select 50 items for each user as the Top-50 recommendation list.
6.6.1. Comparison of Score Prediction Accuracy. In order to compare the prediction accuracy, we compared RMSE and MAE of the three algorithms, as shown in Figures 7(a) and 7(b), respectively. Under five different score matrix sizes, NMF-CF is significantly better than User-CF and Item-CF in terms of both RMSE and MAE. e results of RMSE and MAE for NMF-CF are the smallest, while the Item-CF algorithm obtains the largest prediction error and the worst prediction effect. When the size of the matrix is 400 * 800, compared with the Item-CF algorithm, the result of RMSE for NMF-CF is reduced by 31.64%, and the MAE for NMF-CF is reduced by 28.5%.

Comparison of Recommendation Accuracy.
e recommendation performances of Precision, Recall, and Fmeasure of the three algorithms are shown in Figures 8(a)-8(c), respectively. Under the five different score matrix sizes, as the size of the matrix increases, all the indicators for three algorithms have declined. NMF-CF is superior to User-CF and Item-CF algorithms in all three indicators, which explicitly shows that the quality of CF recommendations based on NMF is the best. However, with the increase of the matrix size n × m, especially the increase of m, it means that the number of items increases, and we only recommend 50 items in collaborative filtering. When calculating the two indicators Precision and Recall, we compare with the 30% test set, and the hit rate of recommended items will be lower, and the advantage of NMF is getting smaller and smaller. When the matrix size is 1600 * 3200, the results of NMF-CF and User-CF algorithm are almost the same. e recommended effect of the Item-CF algorithm has always been the worst.

Comparison of Running Time for Recommendation.
First, we only evaluated the running time of NMF-CF algorithm, and we considered two conditions in Spark platform: (i) CPU-based NMF-CF (only 1 node used) and (ii) CPU + GPU-based NMF-CF (4 nodes used). In the scenarios of five matrix sizes, the result of the running time is shown in Figure 9. It can be seen from the figure that when the GPU acceleration is adopted, the computation time for NMF is significantly reduced. As the matrix size becomes larger, the parallel efficiency is getting higher, and the acceleration effect is also getting better. When the matrix size is 1600 * 3200, due to the utilization of GPU, CPU + GPU-based approach is reduced by 44.8% compared to CPU-based approach, which also proved the acceleration performance of GPU for NMF-CF. en, the running time comparison has been performed in the 4-node Spark platform with GPU support for the three algorithms. In all three algorithms, GPU is used to calculate Pearson correlation coefficient, and NMF-CF algorithm uses GPU to calculate NMF. e running time comparison results are shown in Figure 10. It can be seen from the figure   that, with the increase of the matrix size, the running time of each algorithm has increased. e running time of the NMF-CF recommendation algorithm significantly outperforms the User-CF and Item-CF algorithms. When the matrix size is 1600 * 3200, the running time of the NMF-CF algorithm is reduced by 33.3% compared with the Item-CF algorithm and is reduced by 22.5% compared to the User-CF algorithm.
Overall, compared with the traditional CF, the NMF-CF recommendation algorithm contains the process of the decomposition of the scoring matrix V into W and H, which seems to be a time-consuming operation. In fact, when calculating the correlation coefficient later, it will save a lot of time for NMF-CF. Since the size of matrix W is n × r, where the value of r reflects the number of features or topics, the value of r is usually very small (it generally takes a value of 2 to 10), and the size of matrix W is small, so through calculating the correlation coefficient to obtain k-nearest neighbors for each user takes much less time in NMF-CF algorithm than in the User-CF algorithm or Item-CF algorithm. In addition, except the increased accuracy, NMF-based CF recommendation algorithm uses GPUs to run in parallel and the elapsed computation time is still the shortest.

Conclusion
In the heterogeneous CPU/GPU cluster, nodes have large memory resources and GPU multicore resources, and the advantages of distributed storage between nodes and data sharing within nodes should be utilized. Heterogeneous parallel computing is an efficient and feasible parallel programming strategy. A GPU-accelerated NMF algorithm on Spark platform has been designed in this paper to solve the problem of low processing speed of NMF as the size of the matrix increases.
rough the performance evaluations, experimental results have proved that the combination of Spark-based in-memory computing and GPU has higher execution efficiency. On the other hand, recommendation systems have been widely applied in many fields, but as the user number and item number increase, the computational speed also becomes slower and the accuracy of recommendation decreases. Although traditional collaborative filtering is extremely successful in the recommendation system, as the data increases, the recommendation algorithms have been confronted with various problems, such as scalability problems, cold start problems, and matrix sparseness problems. is paper implemented the NMF algorithm for collaborative filtering recommendation, which combines NMF with traditional collaborative filtering methods, decomposes the original score data into base matrix and projection matrix, and runs in parallel on Spark platform accelerated by GPU. Experiments on matrices with different size show that the parallel NMF collaborative filtering recommendation algorithm not only improves the prediction and recommendation accuracy but also greatly improves the calculation efficiency.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Scientific Programming 13