User Value Identification Based on Improved RFM Model and K -Means++ Algorithm for Complex Data Analysis

College of Information Science and Technology, Beijing University of Chemical Technology, Beijing 100029, China School of Economic and Management, Beijing University of Chemical Technology, Beijing 100029, China Industrial and Commercial Bank of China Limited, Beijing 100088, China Regional Green Economy Development Research Center, School of Business, Wuyi University, Wuyishan 354300, China Datang Carera (Beijing) Investment Co. Ltd., Beijing 100191, China


Introduction
Big data technology is the product of the development of information processing technology. Today is an era of data explosion; the enterprise operation process will produce massive data. Today's big data technology is widely used in various scenarios of enterprise operations, internet of things (IOT) environment [1][2][3][4][5][6], etc., providing decision support for various decisions of enterprises. Compared with the traditional marketing methods such as questionnaire surveys, the purchase behavior of users often reflects the user's psychological preferences. In today's business competition, cus-tomers are the main focus of the company to maintain excellent performance. It is found that the cost of acquiring new customers is much more expensive than retaining existing customers. Thus, what companies care most is how to sell more products to existing customers. Nowadays, the number of e-commerce platforms is increasing rapidly, the operating cost is getting higher and higher, and the marketing expenditure is also expanding accordingly. Using the purchase records of users on the platform to understand the decisions made by users in the real environment has become an urgent problem to be solved in the efficient operation of enterprises. At the same time, as a new business model, a community e-commerce platform has less research on the classification of community e-commerce customers. Based on the above background, this paper provides an effective method for the e-commerce platform to identify customer value and realize a precision marketing strategy.
The research object of this paper is the T-app community e-commerce platform customers; this platform mainly sells cooked food, pasta, and other goods, a total of 134 kinds. The platform focuses on the community traffic, aiming at the needs of community residents, and its customers' purchase behavior has the characteristics of a low single consumption amount and high consumption frequency. The RFM model has good representativeness in reflecting customer value and customer purchase preference and is widely used in financial industry [7], retail [8], insurance and telecommunication [9], education industry [10], and ecommerce industry [11]. The RFM model can well construct the outline of the research object in this paper.
However, in the original RFM model, there is a large randomness in the R index of the latest consumption time. New customers who have just come into contact with the platform and loyal customers of the platform may have the same performance in the R index of the latest consumption time, and the model cannot describe the dependence of customers on a single commodity. In order to help the platform describe the multidimensional attributes of customers and more accurate customer value, an improved RFM model with five indicators is proposed.
Therefore, based on the improved RFM model and K -means++ algorithm, this paper proposes a method suitable for T-app community e-commerce platform user classification. According to the real purchase records of users on the T-app community e-commerce platform, the improved RFM model is used to analyze the data of user purchase records. The K-means++ clustering algorithm is used to classify and calculate the users of the e-commerce platform. Finally, the classification results are explained, and the corresponding customer value analysis is given combined with the specific indicators of each subdivided user. This study uses a quantitative analysis method to segment and cluster platform customers. Customer segmentation with clear value and purchase preference is helpful for the platform to effectively allocate marketing resources for specific customer groups and to establish a healthy long-term relationship with customers.

Relevant Works
2.1. RFM Model. The RFM model was first proposed by Hughes [12]. As a popular tool of customer value analysis, it has been widely used for measuring customer lifetime value [13] and in customer segmentation and behavior analysis [14]. In the following paragraphs, we provide a brief description of the RFM model in the above literature.
RFM is short for recency, frequency, and monetary, which refer to recency of the last purchase, purchase frequency, and monetary value of purchase, respectively. R (recency) represents the time interval between a customer's last purchase date and the end date of a statistical period. The shorter the interval, the bigger the value of R. F (frequency) indicates the number of purchases made by the customer during the statistical period. The larger the value of F is, the higher the customer loyalty and the stronger the intention to repurchase would be. M (monetary) represents the total amount the customer spends in purchases during the statistical period. Hughes attached equal importance to these three variables [12], while Stone believed that the importance of the three variables varies among industries due to their different characteristics, suggesting unequal weights of these variables [13]. The RFM model is widely used in customer value analysis, and researchers have extended it according to different aspects.
Cheng and Chen combined RFM analysis with a rough set theory to establish rules for customer classification [13]. Chiang proposed an RFMDR model (based on an RFM/ RFMD model), an extended version of RFM analysis, to identify valuable online shopping customers for the industry and to generate fuzzy association rules [15]. Kolarovszki et al. have proposed a novel modeling method for postal services using multidimensional segmentation. This CRM design proves useful in postal service companies [16]. Song et al. proposed a statistic-based approach to evaluate potential users via time series. With this approach, it is possible to segment time intervals of RFM in a large-scale dataset [17]. Aiming at the randomness of recent consumption time R index in the RFM model, and the colinearity between index F and index M, Bao et al. proposed an improved RFM model; K-means is used to cluster the customer data and proved the effectiveness of the improved RFM model [18]. In order to help the platform describe the multidimensional attributes of customers and more accurate customer value, an improved RFM model with five indicators is proposed. In this study, we use the RFM model as the basis to select the variables for clustering, so as to establish the clustering criteria objectively.

K-Means++
Algorithm. Clustering is the process of dividing a set of physical or abstract objects into groups of similar objects. The K-means algorithm, as one of the most popular clustering algorithms, was first used by MacQueen in 1967 [19], and it has been used extensively in various fields including data mining, statistical data analysis, and other business applications.
The literature survey reveals that one of the major applications of K-means is customer segmentation [20]. The K -means algorithm is widely used to effectively identify valuable customers and develop pertinent marketing strategies [21]. In particular, Cheng and Chen used the RFM model and K-means to perform customer relationship management, and experimental results demonstrate that the model they proposed is an effective method in customer value analysis [13].
K-Means is a fast method in clustering analysis, but the accuracy and running time of clustering results largely depend on the location of the initial clustering center [22][23][24]. In order to solve the problem that K-means is sensitive to initial points, Arthur et al. proposed the K-means++ algorithm to improve the method of K-means randomly selecting initial clustering centers, that is, to make the distance between the clustering centers as far as possible when selecting initial clustering centers. The results show that K-means++ can significantly improve the final error of classification results [25]. Because the K-means++ algorithm is accurate and efficient, this paper uses the K-means++ algorithm to classify customers.

Improved RFM Model and K-Means++
Clustering Algorithm In order to accurately identify the user value of the T-app community e-commerce platform, this study uses an improved RFM model to extract a user's features and uses the K-means++ clustering algorithm to achieve user classification. The indicators of the traditional RFM model portray customer characteristics from three perspectives: recency of the last purchase (R), frequency of the purchases (F), and monetary value of the purchases (M). However, the user groups and scenarios studied in this article are quite different from the previous literature: (1) the user group is relatively fixed, (2) the consumer goods are relatively singular, and (3) the characteristics of repeated purchases are obvious. Therefore, based on the existing literature, we extract the characteristics of the users studied and improve and model the traditional indicators, as follows.

Indicator Definition
where T last time indicates the time of the customer's last order transaction within the reference time period and T first time indicates the time of the customer's first order transaction within the reference time period.
The calculation formula of the M 1 indicator is where n represents the total number of a customer's consumption in the reference time period and M i represents the amount of single consumption by the customer. The calculation formula of the S indicator is In (3), T hfirst time is the time of the first transaction in the customer's purchase history.
In the improved RFM model, we adopted average consumption time interval (R 1 ) to replace recency of the last purchase (R) in the RFM model, thus overcoming the shortcomings of the large randomness of the R indicator in the traditional RFM model; for regular customers with high transaction frequency, the average order transaction time is more representative. The customer contribution time indicator (S) reflects the customer's loyalty to the platform and continuous consumption ability, that is, the time interval from the first transaction in the customer's history to the last transaction at the reference time, while the repurchase indicator (P) can describe the degree of customer dependence on a single product. In order to achieve the purpose of normalization and standardization of heterogeneous indicators, the R 1 indicator is specially standardized by the reverse standardization method, namely, formula (4), and the other indexes are standardized by the forward standardization method, namely, formula (5).

Determination of Indicator
In the above formula, x ij represents the value of the j-th index of the i-th sample.

3.3.2.
Step1-2: Index Weight Calculation Using Entropy Method. In order to obtain more objective segmentation results, in this study, we use the entropy weight method in  [26]. It considers the agglomeration and separation of clusters comprehensively. Good clustering results should have both a smaller cohesion and a larger separation degree between clusters. In the research of this article, we use the maximum average contour coefficient method to determine the optimal cluster number K. The calculation process is as follows.
Suppose the data to be classified is divided into K clusters. For each vector in the cluster, calculate their contour coefficients separately. For one of the points i, the contour coefficient of the i vector is In formula (6), aðiÞ = average (the distance from the i vector to all other points in the cluster to which it belongs), expressed as the average value of the dissimilar degree from the i vector to other points in the same cluster; bðiÞ = min (the average distance from the i vector to all points in the nearest cluster), expressed as the minimum value of the average dissimilar degree from the i vector to other clusters. Average the contour coefficients of all vectors, and the result is the total contour coefficient of the clustering result. The larger the total contour coefficient value, the more ideal the clustering result.
The calculation result of the total profile coefficient is shown in Figure 1.
When K takes 4, the total contour coefficient of the cluster is 0.4022, which is the largest in the range of (3,9). It can be concluded that the optimal cluster number K opt is 4 in this data set.
In formula (a), Dðx ′ Þ 2 /∑ x∈X DðxÞ 2 , of Figure 2, value D ðxÞ 2 represents the distance from the data point x to the nearest cluster center (select S i (S i = x ′ ∈ X)). For formula (b) of Figure 2, the average error is equal to ∑ n i=1 ½min r=1,⋯,k d ðx i , c r Þ 2 .

Case Study
4.1. Numerical Experiment. The observation period is from September 1, 2018, to December 30, 2018, for three months. There are 3558 customer purchase records and 580 customers. The community shopping platform mainly sells 134 kinds of commodities such as cooked food and pasta. The data is processed as follows.

Raw Data Cleaning and Index Calculation.
The initial data consists of 12 dimensions such as user ID, product ID, purchase quantity, and consumption date. Four dimensions of user ID, consumption amount, consumption time, and product ID are selected from them, and the corresponding five indicators of each customer are calculated to form the initial data, as in Table 1.
In Table 2, the customer whose user ID is YH181102000001 has spent 52 days on the platform. A total of 5 purchases occurred during the reference time period, with a cumulative consumption of 64 yuan, and an average purchase was made on the platform every 10.40 days. The purchase of goods was relatively casual, and no repurchase of goods occurred (the value of P indicator is 1).
The R 1 indicator adopts the reverse standardization method, namely, formula (4), and the other indicators adopt the forward standardization method, namely, formula (5) for standardization.

Empower Indicators.
The entropy method is used to get the weight of each index. The weight of each index is shown in Table 2. At the same time, the weight of each index is multiplied by its corresponding weight to get the weighted data set. R 1 ' , F 1 ' , M 1 ' , S ' , and P ' are the weighted indexes.

Clustering Using K-Means++
Method. After using the K -means algorithm to cluster the weighted data set, we get four different customer groups. Exporting the standard data of four types of customer groups and calculating the average value of five indicators for each type of customer along with the number of users of each type of user, we get Table 3. The data of each user group in Table 3 is the average of the standardized data which has not been weighted.

Customer Value Ranking and Value Analysis.
Since the user data after clustering at this time is weighted data, the total value of users of this type can be obtained by adding up the values of the indicators of each user category, calculated as follows: The calculated results are shown in Table 4.

Wireless Communications and Mobile Computing
It can be seen from the model that the higher the amount of value, the greater the profit contribution of this type of customer to the platform. Therefore, according to the value ranking result and the different characteristics of the customer, we divided customers into important retention customers, development customers, loyal customers, and general customers. From the data, the characteristics of each type of customer are as follows.

Type 1: Important Retention Customers.
Customer group 1 has the greatest value to the platform and is an important maintaining customer of the platform. Its S indicator is large among all customer groups, indicating that this customer group has maintained a long-term consumer relationship with the platform; the M 1 indicator is much higher than other customer groups, indicating that this customer group has made a lot of consumption on the platform and is an important source of profit. At the same time, the value of the P indicator is also much higher than other customer groups, indicating that this customer group is more likely to repurchase a single product. The F 1 indicator and the reverse-standardized R 1 indicator indicate that this type of customer has greater consumer stickiness on the platform and purchases more frequently. As a result, the user characteristics of the platform's high-value customers can be characterized as follows: high consumption frequency, small consumption time interval, high degree of product specificity, and have their own fixed purchase product.  Figure 2: Algorithm flow chart.    it can be concluded that the users' time to contact the platform is average, and there is still room for tapping the value of its consumption. The platform should devote itself to transforming its development into important retention customers.

Type 3: Loyal Customers.
Customer group 3 maintains the longest consumption relationship with the platform among all customers, indicating that this type of customer contacted the platform earlier and maintained a long-term consumption record. Although the consumption amount is not large, they give the platform's cash flow a greater guarantee, being loyal customers of the platform.

Type 4: General Customers.
Customer group 2 does not have a long-term consumer relationship with the platform. The consumption frequency and the total amount of consumption are very low and so also is the repurchase of a single product, indicating that they had not made considerable consumption on the platform and had not contributed to the profit growth of the platform; thus, they belong to the general customers of the platform.

Experimental Result Verification.
After analyzing the value of the customer, we return to the original data of the customer order transaction to verify the T-app customer value analysis result of the improved RFM model. The classification process of the traditional RFM model is similar with the model used in this article, that is, data standardizationcalculation of weight-weighting-clustering-value ranking. Among them, F and M indicators use forward standardization and R indicators use reverse standardization. Later, according to the value ranking, the customers are also divided into important retention customers, loyal customers, development customers, and general customers. Now compare the classification results of the original RFM model with the improved RFM model classification results, as shown in Tables 5 and 6.
The analysis shows that under the RFM model, 34 important retention customers have been identified, all indicators of which are better than the rest of the customers, but no purchase behavior has been made recently. The consumer group 3 on the platform has a higher frequency of consumption and a larger amount of consumption, indicating that it has greater consumer stickiness and is a loyal customer of the platform. All indicators of customer group 4 consumption on the platform need to be improved, and they belong to developmental customers. Customer group 2 consumption amount and consumption frequency are both low, but the recent purchase time indicator has performed well, indicating that it may be due to the customers attracted by the platform's recent marketing strategy, which means that they did not bring considerable profits and cash flow to the platform. So this group belongs to general customers.
The comparison shows that the original RFM model cannot identify the new and old attributes of customers and can only determine the customer's loyalty index based on the customer's consumption frequency and consumption amount, but this has a certain contingency. Customers who make high-frequency consumption in the short-term may terminate their purchases after the platform's promotional activities are over. Such customers should not be considered as having high long-term value. At the same time, improving the RFM model can intuitively feel the differences in various indicators of different user groups. For example, users who spend high on the platform have high repurchase of goods, which can guide the platform to carry out differentiated marketing and improve the customers' experience and ultimately bring platform profit growth. Now take the customer value indicators and customer classification results of some customers through the RFM model and the improved RFM model as an example to illustrate the credibility of the improved RFM model, as shown in Tables 7 and 8.
In Tables 7 and 8, we can see that the user with the ID of YH171201000030 has the same classification results under the two models. Analysis of the RFM model shows that the user of YH171201000030 made a total of 529 yuan worth of purchases during the observation period, with a frequency of 24 purchases, and the recent purchase time R index is relatively close, indicating that the customer has a greater probability of repurchasing next. But at the same time, it can also be found that the RFM model cannot accurately identify whether the user has a greater degree of dependence on a single product and the average time interval for placing orders nor can it define loyalty attributes based on the relationship time between the customer and the platform. For this type of important retention customers, improving the RFM model can provide a more detailed description based on T-app user data, which facilitates the platform to conduct one-to-one precision marketing, which is conducive to the retention of such customers.  For the user with the ID "YH181207000001" under the RFM model, we only know that he had made a total amount of 14 yuan and a frequency of 2, but because his latest purchase happened to be closer to the end of the latest purchase period, he was classified as a development customer. While under the improved RFM model, it can be known that the user has only been exposed to the platform for 3 days and has only made two low-value consumptions on the platform, and his potential commercial value needs to be further developed, so he is correctly divided into general customers. Therefore, the improved RFM model can overcome the randomness of the R indicator in the original RFM model and can accurately locate such new customers.
The user with the ID "YH181018000002" made 8 purchases with a total amount of 138.5 yuan on the platform, and the last purchase was 72 days away from the end of the reference period. Because of his relatively considerable consumption data in the overall sample, he was classified as a loyal customer in the original RFM model. However, through the analysis of the improved RFM model, it can be known that the user has considerable data on the total amount of consumption and consumption frequency. But he was concentrated on a certain product 5 times in a day and had not contacted the platform, indicating that he may be attracted by the platform's recent promotion activities, so the improved RFM model has correctly divide him into general customers, waiting for further observation.
The experimental results given above can show that the improved RFM model can provide a more accurate user description than the original RFM model under the same classification. Moreover, it overcomes the randomness of the R indicator in the original RFM model and perfected the RFM model's shortcomings of only describing the customer stickiness and loyalty through observation of consumption amount and frequency. The improved model can better complete the user value evaluation of T-app.

Conclusions
In order to accurately analyze the customer value of a T-app community e-commerce platform, a customer value analysis method based on an improved RFM model is proposed. The average order transaction time interval, customer transaction times in a certain period, customer total consumption amount in a certain period, customer relationship duration, five repurchase indicators, use of forward and reverse standardization methods to standardize the indicators, and use of the objective assignment method of information theory entropy method to calculate the weight of the five indicators are included. The concept of the silhouette coefficient is used to determine the best K value, and the K-means++ clustering algorithm is used to cluster the weighted indicators; finally, customers are divided into different value customer groups. This paper takes the T-app order transaction data from September 1, 2018, to December 30, 2018, to analyze the customer value, and the results show that the improved RFM model can more accurately analyze the customer value of T-app.

Data Availability
All information is within the article.