Sales forecasting is even more vital for supply chain management in ecommerce with a huge amount of transaction data generated every minute. In order to enhance the logistics service experience of customers and optimize inventory management, ecommerce enterprises focus more on improving the accuracy of sales prediction with machine learning algorithms. In this study, a CAXGBoost forecasting model is proposed taking sales features of commodities and tendency of data series into account, based on the XGBoost model. A CXGBoost model is first established to forecast for each cluster of the resulting clusters based on twostep clustering algorithm, incorporating sales features into the CXGBoost model as influencing factors of forecasting. Secondly, an AXGBoost model is used to forecast the tendency with the ARIMA model for the linear part and the XGBoost model for the nonlinear part. The final results are summed by assigning weights to forecasting results of the CXGBoost and AXGBoost models. By comparison with the ARIMA, XGBoost, CXGBoost, and AXGBoost models using data from Jollychic crossborder ecommerce platform, the CAXGBoost is proved to outperform than other four models.
In order to enhance the logistics service experience of customers in the ecommerce industry chain, supply chain collaboration [
Besides the large quantity and diversity of transaction data [
There are plenty of studies having been undertaken in sales forecasting. The methods of sales forecasts adopted in these studies can roughly be divided into time series models (TSMs) and machine learning algorithms (MLAs) [
TSMs range from the exponential smoothing [
Another important branch of forecasting has been MLAs. The existing MLAs have been largely influenced by stateoftheart forecasting techniques, which range from artificial neural network (ANN), convolutional neural network (CNN), radial basis function (RBF), long shortterm memory network (LSTM), extreme learning machine (ELM) to support vector regression (SVR), etc. [
On the one hand, some existing forecasting models have made comparisons between MLAs and TSMs [
On the other hand, MLAs based on TSMs have also been applied in sales prediction. Wang et al. proved the advantages of the integrated model combining ARIMA with ANN in modeling the linear and nonlinear parts of the data set [
Although there are various methods of forecasting, the choice of methods is determined by the characteristics of different goods [
According to the above literature review, a threestage XGBoostbased forecasting model is constructed to focus on the two aspects (the sales features and tendency of a data series) mentioned above in this study.
Firstly, in order to forecast the sales features, various influencing factors of sales are first introduced in this study by the twostep clustering algorithm [
Secondly, to achieve higher predicting accuracy in the tendency of data series, an AXGBoost model is presented integrating the strengths of the ARIMA and XGBoost model, respectively, for the linear part and the nonlinear part of data series. Therefore, a CAXGBoost model is constructed as the final combination model by weighting for the CXGBoost and AXGBoost models, which takes the multiple factors affecting the sales of goods and the trend of the time series into account.
The paper is organized into 5 sections, the rest of which is organized as follows: In Section
With the emergence of web technologies, there is an everincreasing growth in the amount of big data in the ecommerce environment [
In this study, wrapper feature selection in the forecasting and clustering algorithms is directly applied to removing unimportant attributes in multidimensional data based on standard deviation (SD), the coefficient of variation (CV), Pearson correlation coefficient (PCC), and feature importance scores (FIS), of which the details are as follows.
SD reflects the degree of dispersion of data set, which is calculated as
CV is a statistic to measure the degree of variation of observed values in the data which is calculated as
PCC is a statistic used to reflect the degree of linear correlation between two variables, which is calculated as
FIS provides a score indicating how useful or valuable each feature is in the construction of the boosted decision trees within the model. The more an attribute is used to make key decisions with decision trees, the higher its relative importance [
Clustering aims at partitioning samples into several disjoint subsets, making samples in the same subsets highly similar to each other [
The selection of clustering algorithms mainly depends on the scale and the type of collected data. Clustering can be conducted using traditional algorithms when dealing with numeric or categorical data [
As one of the hierarchical algorithms, the twostep clustering algorithm is also more efficient in handling noise and outliers than partition algorithms. More importantly, it has unique advantages over other algorithms in the automatic mechanism of determining the optimal number of clusters. Therefore, with regard to large and mixed transaction data sets of ecommerce, twostep clustering algorithm is a reliable choice for clustering goods, of which the key technologies and processes are illustrated in Figure
The key technologies and processes of the twostep clustering algorithm.
The clustering feature (CF) tree growth in the BIRCH algorithm is used to read data records in data set one by one, in the process of which the handling of outliers is implemented. Then, subclusters
Take the subclusters
The data records are assigned to the nearest clusters by calculating the loglikelihood distance between the data records and subclusters of the clusters
The performance of clustering results is measured by silhouette coefficient
ARIMA models obtained from a combination of autoregressive and moving average models [
Therefore, a combined method of parameter determination is proposed to improve the fitting performance of the ARIMA, which combines the results of ACF and PACF plots with that of the auto.arima ( ) function. The procedures are illustrated in Figure
The procedures for parameter determination of the ARIMA model.
The XGBoost is short for “Extreme Gradient Boosting” proposed by Friedman [
The specific steps of feature selection via the XGBoost are as follows: data cleaning, data feature extraction, and data feature selection based on the scores of feature importance.
The model is trained based on the selected features with default parameters.
Parameter optimization is aimed at minimizing the errors between predicted values and actual values. There are three types of parameters in the algorithm, of which the descriptions are listed in Table
The description of parameters in the XGBoost model.
Type of parameters  Parameters  Description of parameters  Main purpose 

Booster parameters  Max depth  Maximum depth of a tree  Increasing this value will make the model more complex and more likely to be overfit 
Min_child_weight  Minimum sum of weights in a child  The larger the min_child_weight is, the more conservative the algorithm will be  
Max delta step  Maximum delta step  It can help make the update step more conservative  
Gamma  Minimum loss reduction  The larger the gamma is, the more conservative the algorithm will be  
Subsample  Subsample ratio of the training instances  It is used in the update to prevent overfitting  
Col sample by a tree  Subsample ratio of columns for each tree  It is used in the update to prevent overfitting  
Eta  Learning rate  Step size shrinkage used in the update can prevent overfitting  


Regularization parameters  Alpha  Regularization term on weights  Increasing this value will make the model more conservative 
Lambda  


Learning task parameters  Reg: linear  Learning objective  It is used to specify the learning task and the learning objective 


Command line parameters  Number of estimators  Number of estimators  It is used to specify the number of iterative calculations 
The general steps of determining the hyperparameter of the XGBoost model are as follows:
In this research, a threestage XGBoostbased forecasting model, named CAXGBoost model, is proposed in consideration of both the sales features and tendency of data series.
In Stage 1, a novel CXGBoost model is put forward based on the clustering and XGBoost, which incorporates different clustering features into forecasting as influencing factors. The twostep clustering algorithm is first applied to partitioning commodities into different clusters based on features, and then each cluster in the resulting clusters is modeled via XGBoost.
In Stage 2, an AXGBoost model is presented by combining the ARIMA with XGBoost to predict the tendency of time series, which takes the strength of linear fitting ability of ARIMA and the strong nonlinear mapping ability of XGBoost. ARIMA is used to predict the linear part, and the rolling prediction method is employed to establish XGBoost to revise the nonlinear part of the data series, namely, residuals of the ARIMA.
In Stage 3, a combination model is constructed based on CXGBoost and AXGBoost, named CAXGBoost. The CAXGBoost is aimed at minimizing the sum errors of squares by assigning weights to the results of CXGBoost and AXGBoost, in which the weights reflect the reliability and credibility of sales features and tendency of data series.
The procedures of the proposed threestage model are demonstrated in Figure
The procedures of the proposed threestage model.
The twostep clustering algorithm is applied to clustering a data series into several disjoint clusters. Then, each cluster in the resulting clusters is set as the input and output sets to construct and optimize the corresponding CXGBoost model. Finally, testing samples are partitioned into the corresponding cluster by the trained twostep clustering model, and then the prediction results are calculated based on the corresponding trained CXGBoost model.
The optimal ARIMA based on the minimum of AIC after the data series pass the tests of stationarity and white noise is trained and determined, of which the processes are described in Section
The final results of the test set are calculated by summing the predicted results of the linear part by the trained ARIMA and that of residuals with the established XGBoost.
In this stage, a combination strategy is explored to minimize the error sum of squares
The least squares are employed in exploring the optimal weights (
In equation (
In equation (
In equation (
Equation (
Equation (
According to equation (
To illustrate the effectiveness of the developed CAXGBoost model, the following data series are used to verify the forecasting performance.
As listed in Table
The description of source data series.
Data series  Fields 

Customer behavior data^{a}  Data date; goods click; cart click; favorites click 
Goods information data^{b}  Goods id; SKU^{i} id; level; season; brand id 
Goods sales data^{c}  Data date; SKU sales; goods price; original shop price 
The relationship between goods id and SKU id^{d}  Goods id; SKU id 
Goods promote price^{e}  Data date; goods price; goods promotion price 
Marketing^{f}  Data date; marketing; plan 
Holidays^{g}  Data date; holiday 
Temperature^{h}  Data date; temperature mean 
^{a–f}The six data series are sourced from the historical data of the Saudi Arabian market in Jollychic crossborder ecommerce trading platform (
There are 10 continuous attributes and 6 categorical attributes in clustering series, which are obtained by reconstructing the source data series. The attribute descriptions of the clustering series are illustrated in Table
The description of clustering series.
Fields  Meaning of fields  Fields  Meaning of fields 

Data date  Date  Favorites click  Number of clicks on favorites 
Goods code  Goods code  Sales unique visitor  Number of unique visitors 
SKU code  SKU code  Goods season  Seasonal attributes of goods 
SKU sales  Sales of SKU  Marketing  Activity type code 
Goods price  Selling price  Plan  Activity rhythm code 
Original shop price  Tag price  Promotion  Promotion code 
Goods click  Number of clicks on goods  Holiday  The holiday of the day 
Cart click  Number of clicks on purchasing carts  Temperature mean  Mean of air temperatures (°F) 
To verify the performance of the proposed model according to performance evaluation indexes, some uniform experimental conditions are established as follows.
As shown in Table
The clustering series cover samples of 381 days.
For the CXGBoost model, training set 1, namely, samples of the first 347 days in clustering series, is utilized to establish the twostep clustering models. The resulting samples of twostep clustering are used to construct XGBoost models. The test set with the remaining samples of 34 days is selected to validate the CXGBoost model. In detail, the test set is first partitioned into the corresponding clusters by the established twostep clustering model, and then the test set is applied to checking the validity of the corresponding CXGBoost models.
For AXGBoost model, the training set 2 with the samples of 1st–277th days are used to construct the ARIMA, and the validation set is used to calculate the residuals of ARIMA forecast, which are used to train the AXGBoost model. Then, the test set is employed to verify the performance of the model.
The test set had the final 34 data samples, which are employed to fit the optimal combination weights for CXGBoost and AXGBoost models.
The description of the training set, validation set, and test set.
Data set  Samples  Number of weeks  Start date  End date  The first day  The last day 

Training set 1  Clustering series  50  Mar.1, 2017 (WED)  Dec.2, 2017 (SAT)  1  347 
Training set 2  SKU code = 94033  50  Mar.1, 2017 (WED)  Dec.2, 2017 (SAT)  1  277 
Validation set  SKU code = 94033  10  Dec.3, 2017 (SUN)  Feb.10, 2018 (SAT)  278  347 
Test set  SKU code = 94033  5  Feb.11, 2018 (SUN)  Mar.16, 2018 (FRI)  348  381 
Several performance measures have previously been applied to verifying the viability and effectiveness of forecasting models. As illustrated in Table
The description of evaluation indexes.
Evaluation indexes  Expression  Description 

ME 

The mean sum error 
MSE 

The mean squared error 
RMSE 

The root mean squared error 
MAE 

The mean absolute error 
The first priority for optimization is to tune depth and min_child_weight with other parameters fixed, which are the most effective way for optimizing the XGBoost. The ranges of depth and child weigh are 6–10 and 1–6, respectively. Default values of parameters are listed in Table
Default parameters values of XGBoost.
Parameters  Number of estimators  Max depth  Min_child_weight  Max delta step  Objective  Subsample  Eta 

Default value  100  6  1  0  Reg: linear  1  0.3 
Parameters  Gamma  Col sample by tree  Col sample by level  Alpha  Lambda  Scale position weight  
Default value  0.1  1  1  0  1  1 
As shown in Figure
Model summary and cluster quality of the twostep clustering model. (a) The summary of the twostep clustering model. (b) The Silhouette coefficient of cohesion and separation for 12 clusters.
As illustrated in Figure
Cluster sizes of the clustering series by twostep clustering algorithm.
Take the cluster
For
Feature importance score of the C12_3_XGBoost model.
Setting the 11 features of the cluster
The prebuilt
Figure
ME and MAE of C12_3_XGBoost under different parameters. (a) Mean error of the training set. (b) Mean absolute error of the training set.
As shown in Table
The results of CXGBoost for the test set.
Test set  Days  C12_ 
C12_ 
Depth and min_child_weight  Training set 1  Test set  

ME  MAE  ME  MAE  
348th–372th  25  3  C12_3_XGBoost model  (9, 2)  0.351  0.636  4.385  4.400 
373rd–381st  9  4  C12_4_XGBoost model  (10, 2)  0.339  0.591  1.778  2.000 
348th–381st  34  —  —  —  —  —  3.647  3.765 
As illustrated in Figure
ME and MAE of C12_4_XGBoost under different parameters. (a) Mean error of the training set. (b) Mean absolute error of the training set.
As shown in Figure
Plots of (a) SKU sales with days change, (b) ACF, and (c) PACF.
As illustrated in Figure
Plots of (a) SKU sales with days change, (b) ACF, and (c) PACF after the firstorder difference.
As a result, the possible optimal models are ARIMA (2, 1, 2), ARIMA (2, 1, 3), and ARIMA (2, 1, 4) according to the plots of ACF and PACF in Figure
Table
AIC values of the resulting ARIMA by auto.airma ( ) function.
ARIMA (p, d, q)  AIC  ARIMA (p, d, q)  AIC 

ARIMA (2, 1, 2) with drift  2854.317  ARIMA (0, 1, 2) with drift  2852.403 
ARIMA (0, 1, 0) with drift  2983.036  ARIMA (1, 1, 2) with drift  2852.172 
ARIMA (1, 1, 0) with drift  2927.344 


ARIMA (0, 1, 1) with drift  2851.296  ARIMA (1, 1, 1)  2851.586 
ARIMA (0, 1, 0)  2981.024  ARIMA (0, 1, 2)  2851.460 
ARIMA (1, 1, 1) with drift  2852.543  ARIMA (1, 1, 2)  2851.120 
To further determine the optimal model, the AIC and RMSE of ARIMA models under different parameters are summarized in Table
AIC values and RMSE of ARIMA models under different parameters.
ARIMA model  ARIMA (p, d, q)  AIC  RMSE 

1  ARIMA (0, 1, 1)  2850.170  41.814 
2  ARIMA (2, 1, 2)  2852.980  41.572 
3  ARIMA (2, 1, 3)  2854.940  41.567 




The performance evaluation of AXGBoost.
AXGBoost  Validation set  Test set 

Minimum error  −0.003  −8.151 
Maximum error  0.002  23.482 
Mean error  0.000  1.213 
Mean absolute error  0.001  4.566 
Standard deviation  0.001  6.262 
Linear correlation  1  −0.154 
Occurrences  70  34 
The optimal combination weights are determined by minimizing the MSE in equation (
For the test set, the weights
In this section, the following models are chosen for the comparison between the proposed models and other classical models:
In this section, the test set is used to verify the superiority of the proposed CAXGBoost.
Figure
Comparison of the SKU sales with the predicted values of five models in Section
It can be seen that CAXGBoost has the best fitting performance to the original value, as its fitting curve is the most similar in five fitting curves to the curve of actual values
To further illustrate the superiority of the proposed CAXGBoost, the evaluation indexes mentioned in Section
The performance evaluation of ARIMA, XGBoost, AXGBoost, CXGBoost, and CAXGBoost.
Evaluation indexes  ARIMA  XGBoost  AXGBoost  CXGBoost  CAXGBoost 

ME  −21.346  −3.588  1.213  3.647  0.288 
MSE  480.980  36.588  39.532  23.353  10.769 
RMSE  21.931  6.049  6.287  4.832  3.282 
MAE  21.346  5.059  4.566  3.765  2.515 
According to Table
CXGBoost is inferior to CAXGBoost but outperforms the other three models, underlining that CXGBoost is superior to the single XGBoost.
AXGBoost has a superior performance relative to ARIMA, proving that XGBoost is effective for residual modification of ARIMA.
According to the analysis above, the proposed CAXGBoost has the best forecasting performance for sales of commodities in the crossborder ecommerce enterprise.
In this research, a new XGBoostbased forecasting model named CAXGBoost is proposed, which takes the sales features and tendency of data series into account.
The CXGBoost is first presented combining the clustering and XGBoost, aiming at reflecting sales features of commodities into forecasting. The twostep clustering algorithm is applied to partitioning data series into different clusters based on selected features, which are used as the influencing factors for forecasting. After that, the corresponding CXGBoost models are established for different clusters using the XGBoost.
The proposed AXGBoost takes the advantages of the ARIMA in predicting the tendency of data series and overcomes the disadvantages of the ARIMA by applying the XGBoost to dealing with the nonlinear part of the data series. The optimal ARIMA is obtained in comparison of AICs under different parameters and then the trained ARIMA model is used to predict the linear part of the data series. For nonlinear part of data series, the rolling prediction is conducted by the trained XGBoost, of which the input and output are the resulting residuals by the ARIMA. The final results of the AXGBoost are calculated by adding the predicted residuals by the XGBoost to the corresponding forecast values by the ARIMA.
In conclusion, the CAXGBoost is developed by assigning appropriate weights to the forecasting results of the CXGBoost and AXGBoost so as to take their respective strengths. Consequently, a linear combination of the two models’ forecasting results is calculated as the final predictive values.
To verify the effectiveness of the proposed CAXGBoost, the ARIMA, XGBoost, CXGBoost, and AXGBoost are employed for comparison. Meanwhile, four common evaluation indexes, including ME, MSE, RMSE, and MAE, are utilized to check the forecasting performance of CAXGBoost. The experiment demonstrates that the CAXGBoost outperforms other models, indicating that CAXGBoost has provided theoretical support for sales forecast of the ecommerce company and can serve as a reference for selecting forecasting models. It is advisable for the ecommerce company to choose different forecasting models for different commodities instead of utilizing a single model.
The two potential extensions are put forward for future research. On the one hand, owing to the fact that there may be no model in which all evaluation indicators are minimal, which leads to the difficulty in choosing the optimal model. Therefore, a comprehensive evaluation index of forecasting performance will be constructed to overcome the difficulty. On the other hand, sales forecasting is actually used to optimize inventory management, so some relevant factors should be considered, including inventory cost, order lead time, delivery time, and transportation time.
The data used to support the findings of this study are available from the corresponding author upon request.
The authors declare that there are no conflicts of interest regarding the publication of this paper.
This research was supported by the National Key R&D Program of China through the China Development Research Foundation (CDRF) funded by the Ministry of Science and Technology (CDRFSQ2017YFGH002106).