Improvement of K-Means Algorithm and Its Application in Air Passenger Grouping

The k-means is one of the most popular clustering analysis algorithm and widely used in various fields. Nevertheless, it continues to have some shortcomings, for example, extremely sensitive to the initial center points selection and the special points such as noise or outliers. Therefore, this paper proposed initial center points' selection optimization and phased assignment optimization to improve the k-means algorithm. The experimental results on 15 real-world and 10 synthetic datasets show that the improved k-means outperforms its main competitor k-means ++ and under the same setting conditions, namely, using the default parameters,its clustering performance is better than Affinity Propagation, Mean Shift, and DBSCAN. The proposed algorithm was applied to analyze the airline seat selection data to air passengers grouping. The clustering results, as well as absolute deviation rate analysis, realized customer grouping and found out suitable audience group for the recommendation of seat selection services.


Introduction
Clustering is to divide the dataset into nonoverlapping subsets, such that the objects in the cluster are as similar as possible, and the objects between the clusters are as dissimilar as possible [1]. ere are numerous kinds of clustering algorithms, such as AP [2], DPC [3][4][5][6], which show excellent clustering performance. However, as one of the most classic clustering algorithm, the k-means aimed to partition the given dataset into K subsets so as to minimize the within-cluster sum of squared distances continues to be one of the most popular clustering algorithms [7]. Its efficiency and simplicity of implementation make it successfully applied in various fields, such as image [8,9], education [10], bioinformatics [11], medical [12], partial multiview data [13], agricultural data [14], fuzzy decision-making [15].
Optimizing the initial center points may be one of the most effective methods to improve the performance of k-means algorithm. e study of Fränti and Sieranoja [16] reported that (a) the k-means clustering algorithm can be significantly improved by using a better initialization technique and by repeating (re-starting) the algorithm; (b) when the data have overlapping clusters, k-means can improve the results of the initialization technique; (c) when the data have well separated clusters, the performance of k-means depends completely on the goodness of the initialization; (d) initialization using simple furthest point heuristic (Maxmin) reduces the clustering error of k-means from 15% to 6%, on average. With the popularity of deep learning in various fields, optimizing data representation is also a means to improve clustering performance, especially in the face of high-dimensional data. e robust deep k-means (RDKM) algorithm [17] exploit the hierarchical information of multiple-level attributes with using the deep structure to hierarchically perform k-means. e k − means + + [18] provided a simple and effective initial center points optimization method called D 2 − sampling. It adds new center point one by one and assigns different selection probabilities to each potential center point. Since then, especially after being embedded in scikit-learn as the default k-means algorithm, it has almost become the first choice based on partitioning clustering algorithms. However, due to k-means ++ randomly selects the first center point uniformly and randomly adds subsequent center points according to the probability, some special data distribution can also lead to k-means ++ poor results, even unreasonable clustering results. For example, a dataset with five clusters is synthesized and some noise points half-circle surrounding them are added. e clustering result of k-means ++ was shown in Figure 1, where each color represents a cluster. e desired clustering result should be that the points in the upper left corner are divided into five clusters, but the actual result is that the points in the lower (green points) are clustered into a single cluster to be a wrong result. In this paper, some methods were proposed to solve this problem.
Cluster analysis is one of the basic methods of data knowledge discovery. With the development of airline business, ancillary services that satisfy passengers' personal requirement are becoming more and more important for airlines [19,20]. However, owing to the impact of COVID-19, the airline market faced a dramatic regression (2019-2021), compelling airlines to seek revenue other than from flight tickets [21,22]. erefore, establishing ancillary services is significantly important for airlines due to the ability to increase the airline's revenue. In this paper, the improved k-means algorithm is applied into cluster analysis an airline seat selection dataset, which aims to group airline passengers to serve the establishment of auxiliary services.
Based on the above analysis and application requirements, this paper proposed an improved k-means algorithm, called as k-means2o, based on initial center points selection optimization and phased assignment optimization, and realized the clustering analysis on airline seat selection dataset. e main contributions are summarized as follows: (1) Two optimization methods are proposed for the k-means algorithm: initial center points selection and phased assignment. In the initial center points selection optimization, this method inherits the center point incremental strategy of k-means ++ [18], K-MC 2 [23] and AFK-MC 2 [24], but redefines the first center point selection strategy and the subsequent center point incremental strategy. In the phased assignment optimization, the Tukey's rule is adopted to divide dataset into core and noncore sets to realize two-stage assignment, then two assignment strategies are proposed corresponding to the core and noncore sets, respectively. (2) Four popular algorithms, k-means ++ [18], affinity propagation [2], mean shift [25], and DBSCAN [26], are used to verify the effectiveness and the performance improvement of k-means2o based on 15 real-world and 10 synthetic datasets. Further, the impact of core and noncore sets on the clustering result is analyzed. (3) e improved k-means algorithm is applied to an airline seat selection dataset, and the passenger groups who are more willing to pay for seat selection are found out. e absolute deviation rate adr is defined to analyze the significance of passenger grouping.
is provides valuable information for auxiliary services.

Related Works
ere are many possible ways to optimize the initial center points.
e k-means ++ [18] provided D 2 − sampling method which assigns different selection probabilities to each potential center point. Bachem et al. [23] replaced the D 2 − sampling in k-means ++ with MCMC-sampling and obtained a nearly linear improved k-means algorithm K-MC 2 . However, this algorithm defines two data-dependent hypothesis α(X), β(X), which will have an important impact on the clustering result and the algorithm complexity. Subsequently, Bachem et al. [24] solved the hypothesis defect of the K-MC 2 algorithm. ey extended a regular term based on D 2 − sampling of k-means ++. is new algorithm is called AFK-MC 2 . Whether it is K-MC 2 or AFK-MC 2 , they all follow the first center point selection strategy of the k − means + + algorithm, namely that it first samples an initial center uniformly at random. At the same time, they all have similar center point selection methods, that is, a point farther from the currently selected center points has a greater probability of being chosen as the next center point. For more information on the optimization method of the initial center point, please consult the literature [27].
Phased assignment, generally speaking, is to divide the data into different stages to complete the cluster label assignment, or assign the cluster labels to only part of the data, and the remaining part will be removed as outliers, noise, etc. Zhou et al. [28] proposed a three-stage k-means algorithm to cluster data and detect outliers. In the first stage, the fuzzy c-means algorithm is applied to cluster the data. In the second stage, local outliers are identified, and the cluster centers are recalculated. In the third stage, certain clusters   Computational Intelligence and Neuroscience are merged, and global outliers are identified. Im et al. [29] proposed the NK-means algorithm which emphasizes the removal of noise/outliers and is a two-stage k-means algorithm. In the first stage, a greedy algorithm is utilized to remove abnormal points. In the second stage, the center points are optimized in the constructed core set, and cluster label is assigned to each point. In term of preprocessing techniques, k-means ++ is utilized as an additional filtering step to remove out z of data points as outliers before applying the conventional k-means. e clustering process is only performed on the remaining data which are outlierfree. e outliers data are completely removed and not classified to any known cluster as collected initially. e KMOR algorithm is proposed by Gan and Ng [30] assigns outliers to an additional cluster. is algorithm redefines the clustering objective function and takes into account the SSE between outliers and center points. However, it introduces two new parameters to adjust outlier number. e k-meanssharp is proposed by Olukanmi et al. [31] to eliminate the outliers' influences from the clusters' centroid. e detected outliers are completely excluded from the mean measurement only, but they are involved later in the clustering process. However, the data point with all attributes is eliminated completely from centroid measurement. In this case, the algorithm cannot recognize an outlier's presence in every attribute independently.
is is because the single value of the distance metric represents the entire vector instead the single attribute be removed. erefore, an empty cluster may occur in case of the presence of at least one outlier in each data point [32]. e phased assignment is not only used to optimize the k-means algorithm. For example, Yu et al. [33] also adopted a two-stage assignment strategy based on boundary conditions to optimize the DPC clustering algorithm. For a dataset to be clustered, in many cases, users do not care whether it contains outliers, because the outliers themselves are difficult to define, but they definitely want to assign them cluster labels. Wang et al. [34] proposed an improved integrated clustering learning strategy based on three-stage affinity propagation algorithm with density peak optimization theory (DPKT-AP). In the first stage, the clustering center point was selected by density peak clustering. In the second stage, the k-means algorithm was used to cluster the data samples. In the third stage, DPKT-AP used the AP algorithm to merge and cluster the spherical subgroups.

Proposed K-Means Algorithm
Suppose a given dataset X � x 1 , x 2 , . . . , x n , x i ∈ R m , and divide it into K mutually disjoint sets C � C 1 , . . . , C K , so that ∪ K i C i � X and C i ∩ C j � Φ, ∀i, j, i ≠ j.

Initial Center Points Optimization.
Like the k-means ++ algorithm, the k-means2o adopts a strategy of increasing center points one by one until the desired K points are reached. However, the difference is that the new algorithm redefines the selection of the first center point and subsequent center points. For this purpose, first, define the distance function d(x, S) between the point x and the set S: where d(x, x j ) represents the distance between two points x, x j . In this paper, Euclidean distance is selected. Let c i , i � 1, . . . , K represent the center point of cluster c i , i � 1, . . . , K, then the first center point c 1 is selected as follows: where |S core | represents the number of elements in the core set S core . en, the (2) shows that c 1 is the mean value of the core set S core . Let C k � c 1 , . . . , c k represents a set containing k center points, then the selection method of k + 1 th center point c k+1 is as follows: then C k+1 � C k ⋃ c k+1 . Equation (3) shows that c k+1 is the point farthest from the selected center points in the core set S core . e whole process above is shown in Figure 2.

Phased Assignment.
e k-means2o is mainly divided into two stages to complete the clustering. e first stage is to assign cluster label to the core set S core , and the second stage is to assign cluster label to the noncore set S noncore . e Tukey's rule is adopted to divide the dataset X into sets S core , S noncore . Tukey's rule is one of the most robust used techniques for anomaly detection in univariate data [35].
In the first stage, the k-means2o establishes the Tukey's rule for each attribute of the data, and then the judgment results in all dimensions are integrated to determine whether the sample point x belongs to the core set S core .
First, calculate the first quartile Q 1 and third quartile Q 3 on each attribute: en, calculate the upper and lower bounds B upper , B lower as follows: where IQR j � Q j 3 − Q j 1 and r is a scale factor. Finally, calculate the core set S core and noncore set S noncore as follows: Equation (6) shows that this paper will evaluate each attribute of the data individually, and then integrate all m Computational Intelligence and Neuroscience attributes to determine whether it belongs to the core set S core . As long as any attribute does not satisfy the inequality constraints, it will be judged as belonging to S noncore . According to equations (3) and (6), it is obvious that c 2 almost will be the point in the noncore set S noncore , that is, c 2 ∈ S noncore , and c i , i > 2 will also select the point in the noncore set S noncore with a high probability. e scale factor r in equation (5) is a predefined adjustable parameter. If you have sufficient prior knowledge of dataset, you can set it depending on experience. If not, it recommends to set r � 1.5. Although in the field of anomaly detection research, r � 1.5 is often regarded as the boundary value of the outlier. In cluster analysis, points in S noncore cannot be regarded as outliers and discarded, and they still need to be assigned cluster labels. Whether points in S core or in S noncore , in the final clustering result, it is necessary to assign cluster labels which are also one of the goals of cluster analysis. On the 15 real datasets in this paper, each sample has an exact class label, but the S noncore of almost all datasets are not empty. After constructing S core , it is more helpful to obtain a more excellent initial center points. Not only that S core effectively assists the selection of the initial center points but also has a positive effect on the update of center points.
When we obtain S core , use the initial center points selection method described in Section 3.1 to select the initial center points set C K from S core , and then use the traditional center points update method of k-means to complete clustering in S core . Obtain the optimal clustering center points set C K and clusters C 1 , . . . , C K . e first stage of clustering ends.
In the second stage, points in S noncore will be assigned cluster label. With the help of the optimal clusters C 1 , . . . , C K obtained in the first stage, determine the cluster label of ∀x i ∈ S noncore : (1), and C k is the k − tk cluster. e whole process above is shown in Figure 3.

Algorithm Flow and Complexity
Analysis. e k-means2o algorithm that optimizes the initial center points selection and phased assignment are performed. e algorithm 1 shows its detail process. e steps 1-15 corresponds to the first stage, including that the Step 1 determines S core , S noncore , and the Steps 2-4 optimize the initial center points. e Steps 16-19 correspond to the second stage.
According to the detailed steps in algorithm 1, the complexity of k-means2o algorithm is analyzed with data size n, attribute m, and cluster number K. e number of iterations is denoted as t, and its maximum value is max iter.
Step 1 generates S core , S noncore with O(nm). Steps 2-5 select initial center points with O(nK). Steps 6-13 are a traditional k-means clustering process; however, Step 8 is a new label assignment strategy, so the complexity of these steps becomes O(n 2 t). In summary, the complexity of the k-means2o algorithm is O(n 2 t).

Performance Analysis of the Proposed Algorithm
In this section, the improved k-means algorithm, k-means2o, testing and verification for clustering performance compared with the well-known k-means ++ [18] which is the most commonly used partition-based algorithm with different initializations of the centroids to reduce the sensitivity. en, the performance of the k-means2o will be compared with affinity propagation (AP) [2], mean shift (MS) [25], DBSCAN [26]. Although the latter obtain excellent clustering performance on some special datasets, they require to preset one or more important parameter(s), which is a very difficult task. e k-means2o is designed with    Computational Intelligence and Neuroscience Python and k-means ++, AP, MS, DBSCAN are called from scikit-learn [36].

Datasets and Evaluation Metrics.
A total of 15 real-world datasets used in the experiments were taken from UCI [37]. e data size n, attribute m, and cluster number K are summarized in Table 1 and Table 2 shows 10 synthetic datasets from references [38,39], where the K1 dataset is synthesized by this paper, see Figure 1. All datasets are publicly available1.
An appropriate and uniform evaluation index is both required and meaningful to compare the different clustering algorithms.
erefore, the quality was measured via the accuracy (ACC), the Adjusted Rand Index (ARI) [40], the Normalized Mutual Information (NMI) [41] and the Fowlkes-Mallows Index (FMI) [42] between the produced clusters and the truth categories. Larger evaluation index values indicate improved clustering performance, and all index upper bounds � 1, representing perfectly correct clustering: , , where U, V are predicted label and true label.

Experimental Results and
Discussion. e experimental datasets were clustered using k-means ++ and k-means2o. e ACC, ARI, NMI, and FMI of them are listed in Tables 3  and 4, where k-++ represents k-means ++ and k-2o represents k-means2o. e best clustering performance evaluation values are shown in bold, and 1 means that the clustering result is completely correct. e value 0.0000 in the table represents its real metric value < 0.0001.
From Table 3, the k-means ++ and k-means2o simultaneously obtained the maximum FMI value for 8 of the 15 datasets. is shows that the two algorithms have the same performance, and further performance comparison and Input: Dataset X, cluster number K, scale factor r Output: Clustering results C � C 1 , . . . , C K , center points set C K , sum of squared error SSE (1) Using (6) divide dataset X into S core , S noncore (2) Using (2) generate c 1 (3) For i � 2 to K do (4) Using (3) generate c i (5) End for (6) For j � 1 to max iter do (7) for ∀x ∈ S core do (8) According to the principle of the nearest distance between x and C K , classify x into the corresponding cluster (9) end for (10) if SSE does not change then (11) break (12) end if (13) end for (14) Update the center points set C K and compute SSE (15) Compute the optimal center points C K (16) for ∀x ∈ S noncore do (17) According to the principle of the nearest distance between x and S core , classify x into the corresponding cluster (18) end for (19) Compute SSE (20) Return clustering results C � C 1 , . . . , C K , center points set C K , sum of squared error SSE ALGORITHM 1: k-means2o.   Table 3, the most significant and direct conclusion is that the k-means2o outperforms the k-means ++ on most datasets, and the performance of the two algorithms is also very close on a few datasets that are inferior to k-means ++. Specifically, the k-means2o achieved the maximum ARI value for 10 of the 15 datasets, as well as the NMI and it obtained the same result, and the k-means ++ achieved the best clustering performance only on 6 datasets in ARI, as well as in NMI. For banknote, iris, wine datasets, the k-means2o is only inferior to k-means ++ with a small gap. For ACC evaluation, it comes to the exact same conclusion as NMI and ARI, that is, the k-means2o clustering performance is better than the k-means ++.
For the synthetic datasets in Table 2, the four evaluation metrics in Table 4 show that k-means ++ and k-means2o have similar clustering performance. For datasets with spherical cluster distribution, such as D31, R15, S1, and S3, the clustering results of the two algorithms are close to the real cluster partition, while for datasets with nonspherical distribution such as spiral, flame, circlesA3, the clustering performance of them drops sharply. When the size of the distribution area of spherical clusters is significantly different, the performance difference between k-means ++ and k-means2o can be revealed. For example, in the aggregation dataset, the two algorithms' clustering results are shown in Figure 4. e evaluation values of ARI, NMI, and FMI all show that k-means ++ is better than k-means2o, but ACC gives the opposite conclusion. Figure 4(a) shows that k-means ++ selects seven center points in six real clusters, and two different clusters (green points in the figure) are wrongly classified into the same cluster. Figure 4(b) shows that k-means2o can select center points in seven real clusters, respectively.
Further, the performance of the k-means2o will be compared with AP, MS, and DBSCAN. e ARI and NMI of these algorithms are listed in Table 5, and the ACC and FMI are listed in Table 6. e values larger than the one of the k-means2o are marked in bold. e three comparison algorithms all use default parameters. Considering better performance, the data are normalized here. From the perspective of ARI values, compared with AP, MS, and DBSACN, the k-means2o obtained better clustering performance on 12,14,13 datasets, respectively. e evaluation results of NMI are similar to ARI, except for the AP algorithm. e AP's measurement results of NMI and ARI are very different, which may be tied to the number of error clusters given by the AP algorithm. e ACC evaluation conclusion is consistent with ARI, but FMI and NMI reach opposite conclusions. For the MS algorithm, its FMI value is better than k-means2o algorithm in 9 out of 15 datasets, while for the AP algorithm, its FMI value on all datasets is smaller than k-means2o algorithm. Based on the four evaluation metrics, the k-means2o algorithm is superior to the comparison methods in at least three of these metrics on most datasets. erefore, k-means2o has better clustering performance.
As for the abnormal conclusion given by a certain evaluation metric for a specific algorithm, for example, the NMI evaluation metric for the AP algorithm, the FMI evaluation metric for the MS algorithm, it may be caused by too many or too few clusters. Table 7 shows that the AP and MS algorithms give the wrong number of clusters on any datasets, and the former far exceeds the true number of clusters, while the latter divides more than half of the datasets into one cluster. Undeniably, the AP, MS, and DBSCAN algorithms provide a method to identify the number of clusters. If the parameters for the AP algorithm, damping factor, and preference value are carefully adjusted, it maybe achieves better clustering performance in these real-world datasets. In those clustering algorithms that contain parameters, careful selection of parameters is often time-consuming and requires prior knowledge. erefore, these algorithms have poor universality. e performance of all five algorithms can be directly compared in Figure 5. In this radar chart, each axis    represents a dataset, and its value is the cluster evaluation ARI value. According to the previous analysis, the k-means2o has the best performance, and its corresponding red line in the radar chart reaches the maximum value on more polar axes, that is, farther away from the center point.

Comparative Analysis of Different Initialization Methods.
In this subsection, the effects of three different initialization methods on the performance of the k-means clustering algorithm are compared. ese three methods are represented by Random, D 2 -sampling, New respectively, see the header of Table 8. Random means randomly initializing the center point. D 2 -sampling means assigning a selection probability to each noncenter point and randomly selecting the center point. New means the center point initialization optimization method proposed in this paper. In fact, the k-means algorithm based on D 2 -sampling is the famous k-means ++ algorithm. e initial center points optimization plays an important role in the performance improvement of k-means2o. However, Table 8 shows that only using the initialization method proposed in this paper cannot improve the clustering performance. From the evaluation value of ARI, the optimal initialization method is D 2 -sampling, followed by Random, and the worst is New which is the initialization method proposed in this paper. Except for tiny numerical differences on individual datasets, the NMI evaluation shows similar conclusions. Combined with the conclusion of k-means2o performance improvement, it is the combination of initial center point optimization and phased assignment that improves the performance of k-means2o, not just the center points optimization.    -cancer  43  12  2  2  Bupa  32  14  2  2  Ct  20  7  2  2  Iris  9  2  1  3  Parkinsons  21  5  2  2  Vowel  85  1  1  11  Waveform40  157  1  1  3  Wine  14  1  3  3  Banknote  45  1  1  2  Compound  15  3  1  6  Hayes-roth  16  1  4  3  Libras  30  1  6  15  Penbased  199  1  7  10  Waveform21  148  1  2  3  Wdbc  43  12  2  2   8 Computational Intelligence and Neuroscience

Impact Analysis of Core and Noncore Sets.
is paper uses Tukey's rule to realize the division of S core and S noncore . erefore, a scale factor r needs to be given. Tukey's rule comes from the field of anomaly detection. Generally, the scale factor is set to 1.5. Points that do not meet the conditions of the scale factor are called outliers. In most cases, these points are directly abandoned. is idea is introduced into cluster analysis and used in the data preprocessing stage. As a result, the points detected as abnormal will be discarded and not assign cluster label.
ere will be great hidden trouble in this way. Table 9 shows the number of elements in S core and S noncore in 15 real-world datasets when r � 1.5. Except that the S noncore of compound dataset is empty, the S noncore of the remaining 14 datasets are not empty. However, as well as we known, all points in these datasets are labeled with class labels. erefore, it is unreasonable to abandon these suspected outliers simply and rudely. For this reason, this paper proposes a two-stage assignment method, whose first stage assigns cluster label to the points in S core and second stage assigns the points in the S noncore . For the compound dataset, the empty S noncore indicates that Tukey's rule has no effect on this dataset and will directly lead to the failure of the second stage assignment. e k-means2o algorithm relies on a predefined scale factor r, so it is necessary to perform a sensitivity test of this parameter. erefore, we took the iris, wine, breast_cancer, banknote, and bupa datasets as an example to investigate the effects of different r on ARI and NMI, as shown in Figure 6. Its shows that the ARI and NMI curves of the five datasets do not fluctuate drastically, so the clustering performance of the k-means2o algorithm based on the scale factor r is relatively robust. Nevertheless, the scale factor r still has a slight impact on the clustering performance. For example, in the iris dataset, when r � 0.5, its ARI and NMI values reach  is clustering result is better than k-means ++, see Table 3 (the values are ARI � 0.7302 and NMI � 0.7581).
In the above analysis, the k-means2o outperforms k-means ++, AP, MS, and DBSCAN. Combined with the fact, almost all S noncore of these datasets in Table 9 are nonempty. ese results show that the combination optimization of the initial center point and the core subset works and improves the k-means clustering performance.

The Application of K-Means2o
In this section, the k-means2o is applied into cluster to analyze the airline seat selection dataset provided by Neusoft. According to the meaning of clustering, the samples in the same cluster are as similar as possible, and the samples between different clusters are as dissimilar as possible. If most samples in the same cluster have a certain property, it can be inferred that other samples in the same cluster are also most likely to have the same property. If the most passengers in the cluster are willing to accept some of the personalized recommendation service, such as paying for seat selection, the same service should be recommended to other passengers in the cluster, and a clearer audience group will increase the personalized recommendation service success rate. For the airline seat selection dataset, the appropriate clusters number is required to be determined first. e silhouette coefficient is a simple and effective method to determine the appropriate clusters number for the k-means algorithm. e silhouette coefficient of the k-means2o algorithm on this dataset is shown in Figure 7. e figure shows that the SSE change tends to be gentle from 16 clusters. erefore, the optimal number of clusters would be selected as 16. en, the k-means2o is applied and divides the data into 16 clusters. e number of passengers in each cluster is shown in the column named as size in Table 10. e 3rd, 4th, and 5th columns of Table 10 (payment, no-payment, payment ratio), respectively, show the number of paid passengers, the number of nonpaid passengers and the proportion of paid ones in the airline seat selection. e absolute deviation rate adr in the last column is defined as follows: where r c is the payment rate in cluster c and r is the payment rate in the dataset. e larger the adr value, the more significant the difference between the payment behavior of passengers in the cluster and the whole dataset. e clustering results show that the number of passengers in each cluster is not close. e cluster with the largest number of passengers is C0, with 2580, while the smallest one is C13, with 379.
Further, the significant differences are explored between clusters. Figure 8 shows the kernel density estimation curves of three attributes, pax_fcny, pax_tax, recent_gap_day. On the whole, these curves in each cluster are not completely coincident, and there are significant differences, which show that the data distribution of each cluster is different. is conclusion is consistent with the expectation of cluster analysis, that is, the samples between clusters are dissimilar as much as possible. From a single attribute point of view, the discrimination of pax_fcny attribute is the most significant, with different mean point, peak point, and data span. Followed by pax_tax attribute. e third one is recent_gap_day attribute. Its mean and span are very similar, but the peak point is still different. e difference of peak points indicates that there are differences in the   concentration of data distribution in the cluster. e larger the peak value, more points are distributed near the mean value. Table 10 discusses the k-means2o algorithm clustering results of the airline seat selection dataset from the similarities within clusters and dissimilarities between the clusters. e clustering results will be a good reference basis for customer grouping. Air passenger grouping will enable the decision-makers to more accurately find the audience of the personalized recommendation service, such as payment for airline seat selection. e dataset shows the label of payment for airline seat selection. e adr value of each cluster is greater than 12%, which is significantly different from the payout rate of the entire dataset of 6.29%. e cluster with the largest adr value is C13, reaching 66.45%, and the one with the smallest adr value is c5, reaching 12.56%. ese results show that passenger payment behavior within clusters is more agglomerated compared to the entire dataset. Since the payment rate of C13 is 2.11%, it is a reverse difference. In other words, the adr � 66.45% indicates that passengers in C13 are extremely unwilling to pay for seat selection, and the willingness to pay is significantly lower than the overall level. In 9 of the 16 clusters, the ratio of paying for airline seat selection exceeds 5%. According to the precise recommendation or personalized marketing strategy, enterprises should pay more attention to the passengers in these nine clusters, and their marketing is more likely to succeed. Compared with the passengers in other clusters, the ones in these clusters will be more willing to accept such recommendations and enhance their stickiness. On designing a recommendation system, this clustering result will become a good auxiliary prior information.

Conclusion
In this paper, two optimization methods for k-means are initial center points selection and phased assignment were proposed, and then the improved k-means algorithm, k-means2o, were proposed. In contrast to the previously introduced algorithms, k-means ++, K-MC 2 , and AFK-MC 2 , the new initial center points selection optimization redefines the first center point selection strategy and the subsequent center point incremental strategy. e phased assignment optimization adopted the Tukey's rule to divide dataset into core and noncore sets, then two assignment strategies were proposed corresponding to the core and noncore sets, respectively. ese two optimization methods complement each other to form combinatorial optimization. e experimental results on 15 real-world and 10 synthetic datasets show that the k-means2o outperforms its main competitor k-means ++, and under the same setting conditions, namely using the default parameters, the clustering performance of k-means2o is better than affinity propagation, mean shift, and DBSCAN. e improved k-means algorithm, k-means2o, is applied to analyze the airline seat selection dataset. Combined with the data label of paying for seat selection, the clustering results realize customer grouping, and find suitable audience group for the recommendation of seat selection services.
rough the analysis of the newly defined absolute deviation rate adr index, the appropriate groups for service recommendation are found, and the groups that are not suitable for recommendation are distinguished. erefore, the airline enterprises can use limited resources to promote the groups with high-payment willingness, improve the success rate, and avoid promoting seat selection services to the groups with lowpayment willingness which not only wastes resources but also causes passengers' disgust.
After a lot of experimental tests, the k-means2o algorithm, like other algorithms, cannot be adapted to all fields and situations, such as high-dimensional sparse data. If the data are a huge number of attributes or higher dimensions, it will easily lead to fewer samples in S core , and in extreme cases, it may be less than the number of clusters. e Olivetti Face image data with 112 * 92 � 10304 dimension have been tested and found that |S core | < 40, that is, the number of samples in the core set is less than the number of clusters; therefore, the clustering fails. Due to the division of the core and noncore sets, the k-means2o algorithm is not suitable for huge number of attributes or higher dimensions. We will continue to study this problem and hope to solve this problem in the future.

Conflicts of Interest
e authors declare that they have no conflicts of interest.