An Application of a Three-Stage XGBoost-Based Model to Sales Forecasting of a Cross-Border E-Commerce Enterprise

Sales forecasting is even more vital for supply chain management in e-commerce with a huge amount of transaction data generated every minute. In order to enhance the logistics service experience of customers and optimize inventory management, e-commerce enterprises focus more on improving the accuracy of sales prediction with machine learning algorithms. In this study, a C-A-XGBoost forecasting model is proposed taking sales features of commodities and tendency of data series into account, based on the XGBoost model. A C-XGBoost model is first established to forecast for each cluster of the resulting clusters based on two-step clustering algorithm, incorporating sales features into the C-XGBoost model as influencing factors of forecasting. Secondly, an A-XGBoost model is used to forecast the tendency with the ARIMA model for the linear part and the XGBoost model for the nonlinear part. +e final results are summed by assigning weights to forecasting results of the C-XGBoost and A-XGBoost models. By comparison with the ARIMA, XGBoost, C-XGBoost, and A-XGBoost models using data from Jollychic cross-border e-commerce platform, the C-A-XGBoost is proved to outperform than other four models.


Introduction
In order to enhance the logistics service experience of customers in the e-commerce industry chain, supply chain collaboration [1] requires that commodities are stocked in advance in local warehouses of various markets around the world, which can effectively reduce logistics time. However, for cross-border e-commerce enterprises, the production and sales areas of e-commerce products are globalized, which takes them longer to make preparations from the procurement of commodities, transportation, to customs quality inspection, etc. erefore, algorithms and technologies of big data analysis are widely applied to predict sales of e-commerce commodities, which provide the data basis for the supply chain management and will provide key technical support for the global supply chain scheme of cross-border e-commerce enterprises.
Besides the large quantity and diversity of transaction data [2], sales forecasts are affected by many other factors due to the complexity of the cross-border e-commerce market [3,4]. erefore, to improve the precision and efficiency of forecasting, consideration of various factors in sales forecasting is still a challenge for e-commerce enterprises.
ere are plenty of studies having been undertaken in sales forecasting. e methods of sales forecasts adopted in these studies can roughly be divided into time series models (TSMs) and machine learning algorithms (MLAs) [5,6].
TSMs range from the exponential smoothing [7] to the ARIMA families [8], which have been used extensively to predict future trends by extrapolating based on historical observation data. Although TSMs have been proven to be useful for sales forecasting, their forecasting ability is limited by their assumption of a linear behavior [9], and they do not take external factors such as price changes and promotions into account [10]. erefore, univariate forecasting methods are usually adopted as a benchmark model in many studies [11,12].
Another important branch of forecasting has been MLAs. e existing MLAs have been largely influenced by state-of-the-art forecasting techniques, which range from artificial neural network (ANN), convolutional neural network (CNN), radial basis function (RBF), long short-term memory network (LSTM), extreme learning machine (ELM) to support vector regression (SVR), etc. [13].
On the one hand, some existing forecasting models have made comparisons between MLAs and TSMs [14]. Ansuj et al. showed the superiority of ANN on the ARIMA method in sales forecasting [15]. Alon et al. compared ANN with traditional methods, including Winters exponential smoothing, Box-Jenkins ARIMA model, and multivariate regression, indicating that ANNs perform favorably in relation to the more traditional statistical methods [16]. Di Pillo et al. assessed the application of SVM to sales forecasting under promotion impacts, which was compared with ARIMA, Holt-Winters, and exponential smoothing [17].
On the other hand, MLAs based on TSMs have also been applied in sales prediction. Wang et al. proved the advantages of the integrated model combining ARIMA with ANN in modeling the linear and nonlinear parts of the data set [18]. In [19], an ARIMA forecasting model was established and the residual of the ARIMA model was trained and fitted by the BP neural network. A novel LSTM ensemble forecasting algorithm was presented by Choi and Lee [20] that effectively combines multiple forecast results from a set of individual LSTM networks. In order to better handle irregular sales patterns and take various factors into account, some algorithms have been attempted to exploit more information in sales forecasting as an increasing amount of data are becoming available in e-commerce. Zhao and Wang [21] provided a novel approach to learning effective features automatically from structured data using CNN. Bandara et al. attempted to incorporate sales demand patterns and cross-series information in a unified model by training the LSTM model [22]. More importantly, ELM was widely applied in forecasting. Luo et al. [23] proposed a novel datadriven method to predict user behavior by using ELM with distribution optimization. In [24], ELM was enhanced under deep learning framework to forecast wind speed.
Although there are various methods of forecasting, the choice of methods is determined by the characteristics of different goods [25]. Kulkarni et al. [26] argued that product characteristics could have an impact on both searching and sales due to the characteristics inherent to products were the main attributes that potential consumers were interested in. erefore, to better reflect the characteristics of goods into sales forecasting, clustering techniques have been introduced to forecast [27]. For example, in [28,29], both fuzzy neural networks and clustering methods were used to improve the results of neural networks. Lu and Wang [30] constructed the SVR to deal with the demand forecasting problem with the aid of the hierarchical self-organizing maps and independent component analysis. Lu and Kao [31] put forward a sales forecasting method based on clustering using extreme learning machine and combination linkage method. Dai et al. [32] built a clustering-based sales forecasting scheme based on SVR. A clustering-based forecasting model by combining clustering and machine learning methods was developed by Chen and Lu [33] for computer retailing sales forecasting.
According to the above literature review, a three-stage XGBoost-based forecasting model is constructed to focus on the two aspects (the sales features and tendency of a data series) mentioned above in this study.
Firstly, in order to forecast the sales features, various influencing factors of sales are first introduced in this study by the two-step clustering algorithm [34], which is an improved algorithm based on BIRCH [35]. en, a C-XGBoost model based on clustering is presented to model for each cluster of the resulting clusters with the XGBoost algorithm, which has been proved to be an efficient predictor in many data analysis contests such as Kaggle and in many recent studies [36,37].
Secondly, to achieve higher predicting accuracy in the tendency of data series, an A-XGBoost model is presented integrating the strengths of the ARIMA and XGBoost model, respectively, for the linear part and the nonlinear part of data series. erefore, a C-A-XGBoost model is constructed as the final combination model by weighting for the C-XGBoost and A-XGBoost models, which takes the multiple factors affecting the sales of goods and the trend of the time series into account. e paper is organized into 5 sections, the rest of which is organized as follows: In Section 2, the key models and algorithms employed in the study are shortly described, including the feature selection, two-step clustering algorithm, a method of parameter determination of the ARIMA, and the XGBoost. In Section 3, a three-stage XGBoost-based model is proposed to forecast both the sales features and tendency of time series. In Section 4, numerical examples are used to illustrate the validity of the proposed forecasting model. In Section 5, the conclusions along with a note regarding future research directions are summarized.

Feature Selection.
With the emergence of web technologies, there is an ever-increasing growth in the amount of big data in the e-commerce environment [38]. Variety is one of the critical attributes in big data as they are generated from a wide variety of sources and formats, including text, web, tweet, audio, video, click-stream, and log files [39]. In order to remove most irrelevant and redundant information from various data, many techniques of feature selection (removing variables that are irrelevant) and feature extraction (applying some transformations to the existing variables to obtain a new one) have been discussed to reduce the dimensionality of the data [40], including filter-based and wrapper feature selection. Wrapper feature selection employs a subroutine statistical resampling technique (such as cross-validation) in the actual learning algorithm to forecast the accuracy of feature subsets [41], which is a better choice for different algorithms modeling the different data series. Instead, filter-based feature selection is suitable for different algorithms, modeling the same data series [42].
In this study, wrapper feature selection in the forecasting and clustering algorithms is directly applied to removing unimportant attributes in multidimensional data based on standard deviation (SD), the coefficient of variation (CV), Pearson correlation coefficient (PCC), and feature importance scores (FIS), of which the details are as follows.
SD reflects the degree of dispersion of data set, which is calculated as σ, where N and μ denote the number of samples and mean value of the sample x i , respectively: CV is a statistic to measure the degree of variation of observed values in the data which is calculated as c v : PCC is a statistic used to reflect the degree of linear correlation between two variables, which is calculated as r: where ((X i − X)/σ X ), X, and σ X represent the standard deviation, mean value, and standard score of X i . FIS provides a score indicating how useful or valuable each feature is in the construction of the boosted decision trees within the model. e more an attribute is used to make key decisions with decision trees, the higher its relative importance [43]. e importance is calculated for a single decision tree by the performance measure increased by each attribute split point, weighted by the number of observations the node is responsible for. e performance measure may be the purity such as the Gini Index [44] used to select the split points or another more specific error function. e feature importance is then averaged across all of the decision trees within the model [45].

Two-
Step Clustering Algorithm. Clustering aims at partitioning samples into several disjoint subsets, making samples in the same subsets highly similar to each other [46]. e most widely applied clustering algorithms can broadly be categorized as the partition, hierarchical, density-based, grid-based, and model-based methods [47,48]. e selection of clustering algorithms mainly depends on the scale and the type of collected data. Clustering can be conducted using traditional algorithms when dealing with numeric or categorical data [49,50]. e BIRCH, as one of the hierarchical methods, introduced by Zhang et al. [35] is especially suitable for the large data sets of continuous attributes [51]. But in case of the large and mixed data, the two-step clustering algorithm in SPSS Modeler is advised in this study. e two-step clustering algorithm is a modified method based on BIRCH setting the log-likelihood distance as the measure, which can measure the distance between continuous data and the distance between categorical data [34]. Similar to BIRCH, the two-step clustering algorithm first performs a preclustering step of scanning the entire data set and storing the dense regions of data records in terms of summary statistics. A hierarchical clustering algorithm is then applied to clustering the dense regions. Apart from the ability to handle the mixed type of attributes, the two-step clustering algorithm differs from BIRCH in automatically determining the appropriate number of clusters and a new strategy of assigning cluster membership to noisy data.
As one of the hierarchical algorithms, the two-step clustering algorithm is also more efficient in handling noise and outliers than partition algorithms. More importantly, it has unique advantages over other algorithms in the automatic mechanism of determining the optimal number of clusters. erefore, with regard to large and mixed transaction data sets of e-commerce, two-step clustering algorithm is a reliable choice for clustering goods, of which the key technologies and processes are illustrated in Figure 1.

Preclustering.
e clustering feature (CF) tree growth in the BIRCH algorithm is used to read data records in data set one by one, in the process of which the handling of outliers is implemented. en, subclusters C j are obtained from data records in dense areas while generating a CF tree.

Clustering.
Take the subclusters C j as the object, the clusters C J are obtained by merging the subclusters one by one based on agglomerative hierarchical clustering methods [52] until the optimal number of clusters is determined based on the minimum value of Bayesian information criterion (BIC).

Cluster Membership Assignment.
e data records are assigned to the nearest clusters by calculating the log-likelihood distance between the data records and subclusters of the clusters C J .

Validation of the Results.
e performance of clustering results is measured by silhouette coefficient S, where a is the mean distance between the sample and its cluster and b is the mean distance between the sample and its different cluster. e higher the value of S is, the better the clustering result is: 2.3. Parameter Determination of ARIMA Model. ARIMA models obtained from a combination of autoregressive and moving average models [53]. e Box-Jenkins methodology in time series theory is applied to establish an ARIMA (p, d, q) model, and its calculation steps can be found in [54]. e ARIMA has limitations in determining parameters because its parameters are usually determined based on plots of ACF and PACF, which usually leads to the judging deviation. However, a function named auto.arima ( ) in R package "forecast" [55] is used to automatically generate an optimal ARIMA model for each of the time series based on the smallest Akaike information criterion (AIC) and BIC [56], which makes up for the disadvantage of ARIMA during judging parameters. erefore, a combined method of parameter determination is proposed to improve the fitting performance of the ARIMA, which combines the results of ACF and PACF plots with that of the auto.arima ( ) function. e procedures are illustrated in Figure 2 and described as follows: Step 1. Test the stationary and white noise by the augmented Dickey-Fuller (ADF) and Box-Pierce tests before modeling ARIMA. If both stationarity and white noise tests are passed, the ARIMA is suitable for the time series.
Step 2. Determine a part of parameter combinations based on ACF and PACF plots, and determine another part of parameter combinations by the auto.arima ( ) function in R application.
Step 3. Model the ARIMA under different parameter combinations, and then calculate the values of AIC for different models.
Step 4. Determine the optimal parameters combination of the ARIMA with the minimum of AIC.

XGBoost Algorithm.
e XGBoost is short for "Extreme Gradient Boosting" proposed by Friedman [57]. As the relevant basic theory of the XGBoost has been mentioned in plenty of previous papers [58,59], the procedures of the algorithm [60] are covered in this study rather than basic theory.

Feature Selection.
e specific steps of feature selection via the XGBoost are as follows: data cleaning, data feature extraction, and data feature selection based on the scores of feature importance.

Modeling Training.
e model is trained based on the selected features with default parameters.

Parameter Optimization.
Parameter optimization is aimed at minimizing the errors between predicted values and actual values. ere are three types of parameters in the algorithm, of which the descriptions are listed in Table 1.
e general steps of determining the hyperparameter of the XGBoost model are as follows: Step 1.
e number of estimators is firstly tuned to optimize the XGBoost when fixing the learning rate and other parameters Step 2. Different combinations of max_depth and min_child_weight are tuned to optimize the XGBoost Step 3. Max delta step and Gamma is tuned to make the model more conservative with the determined parameter in Step 1 and Step 2 Step 4. Different combinations of subsample and col-sample_bytree are tuned to prevent overfitting Step 5. Regularization parameters are increased to make the model more conservative Step 6. e learning rate is reduced to prevent overfitting

The Proposed Three-Stage Forecasting Model
In this research, a three-stage XGBoost-based forecasting model, named C-A-XGBoost model, is proposed in   consideration of both the sales features and tendency of data series.
In Stage 1, a novel C-XGBoost model is put forward based on the clustering and XGBoost, which incorporates different clustering features into forecasting as influencing factors. e two-step clustering algorithm is first applied to partitioning commodities into different clusters based on features, and then each cluster in the resulting clusters is modeled via XGBoost.
In Stage 2, an A-XGBoost model is presented by combining the ARIMA with XGBoost to predict the tendency of time series, which takes the strength of linear fitting ability of ARIMA and the strong nonlinear mapping ability of XGBoost. ARIMA is used to predict the linear part, and the rolling prediction method is employed to establish XGBoost to revise the nonlinear part of the data series, namely, residuals of the ARIMA.
In Stage 3, a combination model is constructed based on C-XGBoost and A-XGBoost, named C-A-XGBoost. e C-A-XGBoost is aimed at minimizing the sum errors of squares by assigning weights to the results of C-XGBoost and A-XGBoost, in which the weights reflect the reliability and credibility of sales features and tendency of data series. e procedures of the proposed three-stage model are demonstrated in Figure 3, of which the details are given as follows.

Stage 1. C-XGBoost Model.
e two-step clustering algorithm is applied to clustering a data series into several disjoint clusters. en, each cluster in the resulting clusters is set as the input and output sets to construct and optimize the corresponding C-XGBoost model. Finally, testing samples are partitioned into the corresponding cluster by the trained two-step clustering model, and then the prediction results are calculated based on the corresponding trained C-XGBoost model.

Stage 2. A-XGBoost Model.
e optimal ARIMA based on the minimum of AIC after the data series pass the tests of stationarity and white noise is trained and determined, of which the processes are described in Section 2. en, the residual vector e � (r 1 , r 2 , . . . , r n ) Τ between the predicted values and actual values are obtained by the trained ARIMA model. Next, the A-XGBoost is established by setting columns from 1 to k, and column (k + 1) in R as the input and output, respectively, as is illustrated in the following equation: e final results of the test set are calculated by summing the predicted results of the linear part by the trained ARIMA and that of residuals with the established XGBoost.

Stage 3. C-A-XGBoost Model.
In this stage, a combination strategy is explored to minimize the error sum of squares MSE in equation (6) by assigning weights w C and w A to C-XGBoost and A-XGBoost, respectively. e predicted results are calculated using equation (7), where Y CA (k), y C (k), and y A (k) denote the corresponding forecast values of the k-th sample via C-XGBoost, A-XGBoost, and C-A-XGBoost, respectively. In equation (6), y(k) is the actual value of the k-th sample: e least squares are employed in exploring the optimal weights (w C and w A ), the calculation of which is simplified by transforming the equations into the following matrix operations.
In equation (8), the matrix B consists of the predicted values of C-XGBoost and A-XGBoost.
In equation (9), the matrix W consists of the weights. In equation (10), the matrix Y consists of the actual values.
Equation (11) is obtained by transforming the equation (7) into the matrix form.
Equation (12) is calculated based on equation (11) left multiplying by the transpose of the matrix B.
According to equation (13), the optional weights (w C and w A ) are calculated. ere are 10 continuous attributes and 6 categorical attributes in clustering series, which are obtained by reconstructing the source data series. e attribute descriptions of the clustering series are illustrated in Table 3.

Uniform Experimental Conditions.
To verify the performance of the proposed model according to performance evaluation indexes, some uniform experimental conditions are established as follows. Table 4, the data series are partitioned into the training set, validation set, and test set so as to satisfy the requirements of different models. e data application is described as follows:

Uniform Data Set. As shown in
(1) e clustering series cover samples of 381 days.
(2) For the C-XGBoost model, training set 1, namely, samples of the first 347 days in clustering series, is utilized to establish the two-step clustering models. e resulting samples of two-step clustering are used to construct XGBoost models. e test set with the remaining samples of 34 days is selected to validate the C-XGBoost model. In detail, the test set is first partitioned into the corresponding clusters by the established two-step clustering model, and then the test set is applied to checking the validity of the corresponding C-XGBoost models.

Uniform Evaluation Indexes.
Several performance measures have previously been applied to verifying the viability and effectiveness of forecasting models. As illustrated in Table 5, the common evaluation measurements are chosen to distinguish the optimal forecasting model. e smaller they are, the more accurate the model is.

Uniform Parameters of the XGBoost Model.
e first priority for optimization is to tune depth and min_-child_weight with other parameters fixed, which are the most effective way for optimizing the XGBoost. e ranges of depth and child weigh are 6-10 and 1-6, respectively. Default values of parameters are listed in Table 6.

C-XGBoost Model
(1) Step 1. Commodity clustering: e two-step clustering algorithm is first applied to training set 1. Standardization applies to the continuous attributes; the noise percent of outliers handling is 25%; log-likelihood distance is the basis of distance measurement; BIC is set as the clustering criterion.
As illustrated in Figure 5, the ratio of sizes is 2.64 and the percentage is not too large or too small for each cluster.
erefore, cluster quality is acceptable.
(2) Step 2. Construct the C-XGBoost models: Features are first selected from each cluster C12 j of the 12 clusters based on feature importance scores. After that, setting the selected features of each cluster and SKU sales in Table 3 as the input and output varieties, respectively, the C-XGBoost models are constructed for each cluster C12 j, denoted as C12 j XGBoost. Take the cluster C12 3 in the 12 clusters as an example to illustrate the processes of modeling XGBoost.
For C12 3, the features listed in Table 3 are first filtered and the 7 selected features are displayed in Figure 6. It can be observed that F1 (goods click), F3 (cart click), F5 (goods price), F6 (sales unique visitor), and F7 (original shop price) are the dominating factors. However, F2 (temperature mean) and F4 (favorites click) have fewer contributions to the prediction.
Setting the 11 features of the cluster C12 3 in Step 1 and the corresponding SKU sales in Table 3 as the input and output, respectively, the C12 3 XGBoost is pretrained under the default parameters in Table 6. For the prebuilt C12 3 XGBoost model, the value of ME is 0.393 and the value of MAE is 0.896.
(3) Step 3. Parameter optimization: XGBoost is an algorithm with supervised learning, so the key to optimization is to e six data series are sourced from the historical data of the Saudi Arabian market in Jollychic cross-border e-commerce trading platform (https://www. jollychic.com/). g e data of holidays are captured from the URL http://shijian.cc/114/jieri2017/. h e data of temperature are captured from the URL https:// www.wunderground.com/weather/eg/saudi-arabia. i SKU's full name is stock keeping unit. Each product has a unique SKU number.    Table 5: e description of evaluation indexes.

Evaluation indexes Expression Description
e mean sum error   determine the appropriate input and output variables. In contrast, parameter optimization has less impact on the accuracy of the algorithm. erefore, in this paper, only the primary parameters including max_depth and min_child_weight are tuned to optimize the XGBoost [61]. e model can achieve a balanced point because increasing the value of max_depth will make the model more complex and more likely to be overfit, but increasing the value of min_child_weight will make the model more conservative. e prebuilt C12 3 XGBoost model is optimized to minimize ME and MAE by tuning max_depth (from 6 to 10) and min_child_weight (from 1 to 6) when other parameters are fixed, in which the ranges of parameters are determined according to lots of case studies with the XGBoost such as [62]. e optimal parameter combination is determined by the minimum of the ME and MAE under different parameter combination. Figure 7 shows the changes of ME and MAE based on XGBoost as depths and min_child_weight change. It can be seen that both the ME and MAE are the smallest when depth is 9 and min_child_weight is 2. at is, the model is optimal.
(4) Step 4. Results on the test set: e test set is partitioned into the corresponding clusters by the trained two-step clustering model in Step 1. After that, the Steps 2-3 are repeated for the test set.
As shown in Table 7, the test set is partitioned into the clusters C12_3 and C12 4. en, the corresponding models C12 3 XGBoost and C12 4 XGBoost are determined. C12 3 XGBoost has been trained and optimized as an example in Steps 2-3, and the C12 4 XGBoost is also trained and optimized by repeating Steps 2-3. Finally, the prediction results are obtained by the optimized C12 3 XGBoost and C12 4 XGBoost.
As illustrated in Figure 8, ME and MAE for C12 4 XGBoost change with the different values of depth and min_child_weight. e model performs the best when depth is 10 and min_child_weight is 2 because both the ME and MAE are the smallest. e forecasting results of the test set are calculated and summarized in Table 7.

A-XGBoost Model
(1) Step 1. Test stationarity and white noise of training set 2: For training set 2, the p value of the ADF test and Box-Pierce test are 0.01 and 3.331 × 10 − 16 , respectively, which are lower than 0.05. erefore, the time series is stationary and nonwhite noise, indicating that training set 2 is suitable for the ARIMA.
(2) Step 2. Train ARIMA model: According to Section 2.3, parameter combinations are firstly determined by ACF and PACF plots, and auto.arima ( ) function in R package "forecast." As shown in Figure 9(a), SKU sales have a significant fluctuation in the first 50 days compared with the sales after 50 days; in Figure 9(b), the plot of ACF has a high trailing characteristic; in Figure 9(c), the plot of PACF has a decreasing and oscillating phenomenon. erefore, the firstorder difference should be calculated.
As illustrated in Figure 10(a), SKU sales fluctuate around zero after the first-order difference. Figures 10(b) and 10(c) graphically present plots of ACF and PACF after the first-order difference, both of which have a decreasing and oscillating phenomenon. It indicates that the training set 2 conforms to the ARMA. As a result, the possible optimal models are ARIMA (2, 1, 2), ARIMA (2, 1, 3), and ARIMA (2, 1, 4) according to the plots of ACF and PACF in Figure 10. Table 8 shows the AIC values of the ARIMA under different parameters, which are generated by the auto.arima ( ) function. It can be concluded that the ARIMA (0, 1, 1) is the best model because its AIC has the best performance.
To further determine the optimal model, the AIC and RMSE of ARIMA models under different parameters are summarized in Table 9. e possible optimal models include the 3 possible optimal ARIMA judged by Figure 10 and the best ARIMA generated by the auto.arima ( ) function. According to the minimum principles, the ARIMA (2, 1, 4) is optimal because both AIC and RMSE have the best performance.   (3) Step 3. Calculate residuals of the optimal ARIMA: e prediction results from the 278th to the 381st day are obtained by using the trained ARIMA (2, 1, 4), denoted as ARIMA forecast. en, residuals between the prediction values ARIMA forecast and the actual values SKU_sales are calculated, denoted as ARIMA residuals.

C-A-XGBoost Model.
e optimal combination weights are determined by minimizing the MSE in equation (6).
For the test set, the weights w C and w A are obtained based on the matrix operation equation (13)

Models for Comparison.
In this section, the following models are chosen for the comparison between the proposed models and other classical models: ARIMA. As one of the common time series model, it is used to predict sales of time sequence, of which the processes are the same as the ARIMA in Section 4.3.2.
XGBoost. e XGBoost model is constructed and optimized by setting the selected features and the corresponding SKU sales as input and output. C-XGBoost. Taking sales features of commodities into account, the XGBoost is used to forecast sales based on the resulting clusters by the two-step clustering model. e procedures are the same as that in Section 4.3.1.
A-XGBoost. e A-XGBoost is applied to revising residuals of the ARIMA. Namely, the ARIMA is firstly used to model the linear part of the time series, and then XGBoost is used to model the nonlinear part. e relevant processes are described in Section 4.3.2. C-A-XGBoost. e model combines the advantages of C-XGBoost and A-XGBoost, of which the procedures are displayed in Section 4.3.3.

Results of Different Models.
In this section, the test set is used to verify the superiority of the proposed C-A-XGBoost. Figure 11 shows the curve of actual values SKU_sales and five fitting curves of predicted values from the 348th day to the 381st day, which is obtained by the ARIMA, XGBoost, C-XGBoost, A-XGBoost, and C-A-XGBoost.
It can be seen that C-A-XGBoost has the best fitting performance to the original value, as its fitting curve is the most similar in five fitting curves to the curve of actual values SKU sales.
To further illustrate the superiority of the proposed C-A-XGBoost, the evaluation indexes mentioned in Section 4.2.2 are applied to distinguishing the best model of the sales forecast. Table 11 provides a comparative summary of the indexes for the five models in Section 4.4.
According to Table 11, it can be concluded that the superiority of the proposed C-A-XGBoost is distinct compared with the other models, as its evaluation indexes are minimized.
C-XGBoost is inferior to C-A-XGBoost but outperforms the other three models, underlining that C-XGBoost is superior to the single XGBoost.
A-XGBoost has a superior performance relative to ARIMA, proving that XGBoost is effective for residual modification of ARIMA.
According to the analysis above, the proposed C-A-XGBoost has the best forecasting performance for sales of commodities in the cross-border e-commerce enterprise.

Conclusions and Future Directions
In this research, a new XGBoost-based forecasting model named C-A-XGBoost is proposed, which takes the sales features and tendency of data series into account. e C-XGBoost is first presented combining the clustering and XGBoost, aiming at reflecting sales features of commodities into forecasting. e two-step clustering algorithm is applied to partitioning data series into different clusters based on selected features, which are used as the influencing factors for forecasting. After that, the corresponding C-XGBoost models are established for different clusters using the XGBoost. e proposed A-XGBoost takes the advantages of the ARIMA in predicting the tendency of data series and overcomes the disadvantages of the ARIMA by applying the XGBoost to dealing with the nonlinear part of the data series. e optimal ARIMA is obtained in comparison of AICs under different parameters and then the trained ARIMA model is used to predict the linear part of the data series. For nonlinear part of data series, the rolling prediction is conducted by the trained XGBoost, of which the input and output are the resulting residuals by the ARIMA. e final results of the A-XGBoost are calculated by adding the predicted residuals by the XGBoost to the corresponding forecast values by the ARIMA.
In conclusion, the C-A-XGBoost is developed by assigning appropriate weights to the forecasting results of the C-XGBoost and A-XGBoost so as to take their respective strengths. Consequently, a linear combination of the two models' forecasting results is calculated as the final predictive values.
To verify the effectiveness of the proposed C-A-XG-Boost, the ARIMA, XGBoost, C-XGBoost, and A-XGBoost are employed for comparison. Meanwhile, four common evaluation indexes, including ME, MSE, RMSE, and MAE, are utilized to check the forecasting performance of C-A-XGBoost. e experiment demonstrates that the C-A-XGBoost outperforms other models, indicating that C-A-XGBoost has provided theoretical support for sales forecast of the e-commerce company and can serve as a reference for selecting forecasting models. It is advisable for the e-commerce company to choose different forecasting models for different commodities instead of utilizing a single model. e two potential extensions are put forward for future research. On the one hand, owing to the fact that there may be no model in which all evaluation indicators are minimal, which leads to the difficulty in choosing the optimal model. erefore, a comprehensive evaluation index of forecasting performance will be constructed to overcome the difficulty. On the other hand, sales forecasting is actually used to optimize inventory management, so some relevant factors should be considered, including inventory cost, order lead time, delivery time, and transportation time.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.