E-commerce offers various merchandise for selling and purchasing with frequent transactions and commodity flows. An accurate prediction of customer needs and optimized allocation of goods is required for cost reduction. The existing solutions have significant errors and are unsuitable for addressing warehouse needs and allocation. That is why businesses cannot respond to customer demands promptly, as they need accurate and reliable demand forecasting. Therefore, this paper proposes spatial feature fusion and grouping strategies based on multimodal data and builds a neural network prediction model for e-commodity demand. The designed model extracts order sequence features, consumer emotional features, and facial value features from multimodal data from e-commerce products. Then, a bidirectional long short-term memory network- (BiLSTM-) based grouping strategy is proposed. The proposed strategy fully learns the contextual semantics of time series data while reducing the influence of other features on the group’s local features. The output features of multimodal data are highly spatially correlated, and this paper employs the spatial dimension fusion strategy for feature fusion. This strategy effectively obtains the deep spatial relations among multimodal data by integrating the features of each column in each group across spatial dimensions. Finally, the proposed model’s prediction effect is tested using e-commerce dataset. The experimental results demonstrate the proposed algorithm’s effectiveness and superiority.
The e-commerce platform offers a wide range of commodities, with frequent purchases, transactions, and commodity flows. The dynamic and complex business environment has posed significant challenges for business decision-making. As a result, inventory management has become more complex, and supply chain costs have risen steadily [
Fortunately, the rise of the mobile Internet, low-cost sensors, and low-cost storage has made obtaining large amounts of data more accessible. We can collect many other log data of e-commerce products over time, in addition to historical sales data. It includes consumer reviews, consumer portraits, page views (PV), search page views (SPV), user views (UV), search user view (SUV), selling price (PAY), user location (UL), and total merchandise sales (GMV). It provides a broad space for applying neural networks when combined with cheap computing power, especially the dramatic increase in GPU performance. Based on the above observations, this paper proposes spatial feature fusion and grouping strategies based on multimodal data and builds a neural network prediction model for e-commerce commodity demand. Initially, we consider the multimodal data of e-commerce products (such as historical orders, consumer reviews, and consumer portraits) while extracting different features. These features are order sequence features, consumer emotional features, and facial value features. Finally, we proposed a grouping strategy based on a bidirectional long-term, short-term memory network (BiLSTM). The network fully learns the contextual semantics of time series data while reducing the influence of other features on the local features of the group.
The main innovations of this article are as follows: This paper considers numerical data such as historical orders and text and image data such as consumer comments and portraits. Because nonlinear data like text and image are becoming more important in e-commerce prediction tasks, the analysis value increases. We can intuitively understand the customer’s desire to buy a particular product using semantic sentiment analysis of consumer comments. We can depict the consumer portrait and understand the consumer’s preferences by calculating the appearance level of the consumer portrait, which is useful for improving the prediction model’s performance. This paper proposes a novel grouping strategy to consider both the long-distance dependence and the short-distance dependence in sequence data. It addresses the problem that the recurrent neural network only pays attention to the long-distance dependence in sequence data. Because of the short-distance dependence, when a significant distance separates two sets of features, their connection is weak, and less information is retained. This paper proposes a novel spatial dimension fusion strategy. It effectively obtains the deep spatial relations among multimodal data by integrating each column’s features across spatial dimensions. The prediction effect of this model is verified using a dataset created by an e-commerce platform. The experimental results demonstrate the effectiveness and superiority of our algorithm.
The rest of the paper is organized as follows. In Section
This section discusses related work from various aspects to understand the problem addressed in this paper.
Demand forecasting [ Traditional time series forecasting: the time series forecasting method is based on the continuous law of objective thing development. It is used to further speculate on the future development trend using historical data and statistical analysis. Models for time series forecasting Traditional time series forecasting methods have proven to be simple and efficient when dealing with relatively simple linear data. They are widely used in all walks of life [ Combination forecasting: a single forecasting method is difficult to manage for some relatively complex and challenging forecasting tasks. Its forecasting accuracy can be improved by combining a reasonable number of different methods in a scientific manner. Some scholars have studied combination forecasting methods extensively since J. N. Bates and C. W. J. Granger published “Combined Forecasting” in the 1870s. Huard et al. [ The combined forecasting model, in general, has clear advantages. For example, it can handle some relatively complex and difficult forecasting tasks. The combined forecasting method is better than a single method based on the traditional time series model. It is still inferior to a single method based on the traditional time series model. With complex unstructured multimodal data, however, the combined forecasting method is still difficult to use. Deep learning-based prediction model: with the booming development of the mobile Internet and the arrival of big data, the business of the e-commerce platform has become more complex with huge data. Nonlinear and unstructured data has become the most valuable data. Both traditional time series forecasting methods and combined forecasting methods have been unable to cope with the increasingly complex and challenging task of e-commerce demand forecasting. The mining and processing of nonlinear and unstructured data also have natural disadvantages. Fortunately, advances in computing power and the rise of deep learning [
According to the review, the above deep learning prediction models differ from traditional linear numerical data such as order sales, images, text semantic understanding, and other unstructured multimodal data. It is becoming increasingly important in the e-commerce industry. As a result, using deep learning technology to create relevant predictive models in the e-commerce industry has become commonplace.
Figure
Schematic diagram of the overall architecture of our algorithm.
Feature engineering is a critical step in the data preprocessing stage that ensures the best possible feature data for the prediction task. This article first performs feature construction, feature selection, feature extraction, and feature processing on historical order data.
To extract the basic features, first, select the basic characteristics that influence goods demand, such as the attributes of the commodity itself, sales volume, commodity market performance, and time. In general, ready-made nonattribute feature data required in the research can be extracted statistically when selecting basic features. Each goods ID has 20 basic features, which are counted and extracted in this paper.
The distinguishing factors obtained in this paper include all kinds of commodity attributes, price, market performance, commodity sales, and other characteristic data. We use the time sliding window method to deal with the demand and characteristics of commodities every week. One week (7 days) is taken as a window, in which the demands of each commodity in different areas are called labels. The working principle of the sliding window method is shown in Figure
A schematic diagram of the time sliding window.
This paper uses the scaling method to deal with continuous numerical feature data because some historical e-commerce transaction data features have large values, such as the number of views and favorites. On the other hand, some have relatively small values with an extensive feature value range that is usually not conducive to the algorithm’s convergence speed. As a result, this paper uses a scaling method to process this type of data that produces a mean value 0 and a variance 1, in addition, to increase the learning rate and then increase the speed of model training. Therefore, we standardize all features. The standardized formula is as follows:
It is a one-hot encoding and distributed representation of attribute or category data. Because such characteristic values are discrete rather than continuous, and there is no sequential distinction between categories, one-hot coding is used. The dimension of characteristic data can be reduced, and data sparsity can be reduced by one-hot coding of characteristic values.
In this experiment, text sentiment analysis was used to analyze comments on skincare products that were crawled from an e-commerce platform as shown in Figure
A screenshot of comment data on an e-commerce platform.
Each review has a review star, the content of the review, the product reviewed, and the review time, as shown in Figure
Data sample of consumer reviews.
Positive emotion text | Negative emotion text |
---|---|
I have been using this essence water and essence lotion for more than 2 years, and it is best to use it in autumn and winter | Not suitable for me |
Buy again, good hydrating effect | The QR code cannot scan the product information, and the face does not feel hydrated when used, and it dries quickly, not as easy to use as other brands |
Since the segmentation of wheat field plantation row images is a binary classification task, the numbers of vectors in primary caps and digit caps are both set to 2. The number of capsules in digit caps is also set to 2. In addition, this paper uses the ReLU function as the activation function of the network and uses the sigmoid function for classification.
The existing Chinese word segmentation tool JIEBA is used in this paper to perform Chinese word segmentation tasks. It employs a standard probabilistic language model word segmentation method. It can perform various tasks, including word part-of-speech tagging and keyword extraction from text data. The removal of stop words and vectorization of the text will be easier with good word segmentation.
Figure
Statistical histogram of text data length.
The word2vec model maps words to high-dimensional spaces with high efficiency. This model primarily uses text data’s context information at a higher level. It employs neural networks to map all text data into a more low-dimensional, practical, and dense real number matrix. The skip-gram model as shown in Figure
Schematic diagram of skip-gram.
In Figure
Finally, the
The facial value calculation is used to determine the skin type of the consumer. As far as known, there are five skin types, namely, normal, dry, oily, mixed, and sensitive skin. This paper builds a CNN model to evaluate facial appearance and gives the skin quality classification results.
As shown in Figure
Schematic diagram of facial value calculation and skin quality evaluation model.
The most important feature of a recurrent neural network is making predictions by combining current and previous feature information. The goal is to better preserve the information between the features in the sequence when we only need to consider the most recent part of the information. When a greater distance separates two groups of features, the connection between them is weaker, and thus less information is retained. This situation not only lowers the final prediction accuracy, but also increases the model’s computational complexity. As a result, the GBL grouping sequence strategy was created. The following is the grouping strategy’s calculation equation:
To extract feature sequences
The three groups of extracted feature sequences
Through the grouping strategy in the previous section, we have fully obtained the local context information of each group. However, to ensure the integrity of the contextual information of the features of the entire multimodal data, we use multiscale traditional convolution and multiscale cavity convolution. Its purpose is to achieve the spatial dimension fusion of different sets of features. In order to more intuitively understand the difference between spatial dimension fusion and traditional fusion, we have made a detailed explanation through Figure
(a) Traditional feature fusion strategy. (b) Spatial dimensional feature fusion strategy.
Figure
Figure
The dataset in this article uses historical sales data, consumer review data, and consumer portrait data collected by skincare product e-commerce platform. The dataset contains historical information of 200 products over more than a year, a total of more than 20,000 pieces of data information. Through data cleaning and feature engineering, this paper constructs a training set and a test set that can be used for neural networks.
The main parameters are shown in Table
Hyperparameter setting.
Type | Hyperparameter |
---|---|
Optimizer | Adam |
Learning rate | 0.001 |
0.9 | |
0.999 | |
Epsilon | 1 |
Decay | 3 |
For prediction problems, it is necessary to establish prediction performance evaluation indicators to verify the feasibility and accuracy of the prediction model, considering that e-commerce commodity demand forecasting is generally for the purchase and inventory replenishment of e-commerce companies. The forecast error of the demand for selling a larger number of commodities has a greater impact than selling after commodities under the equivalent error. Therefore, the error selected in this paper should consider the error between the predicted value and the true value and consider the ratio between the error and the true value.
Mean square error (MSE): this indicator is the square of the difference between the real quantity and the predicted quantity and then summed and averaged. The calculation equation is as follows:
Root mean square error (RMSE): this indicator is the square root calculation of the ratio of the square sum of the difference between the real quantity and the predicted quantity to the number of observations. This is used to measure the deviation between the predicted quantity and the real quantity. The calculation equation is as follows:
Mean absolute error (MAE): this metric is used to average absolute error. This value more accurately reflects the current state of the forecast error, i.e., the difference between the actual quantity and the forecast. The following is the calculation formula:
Mean absolute percentage error (MAPE): this metric considers the difference between the predicted and actual value. It also computes the ratio between the predicted error and the true value at the same time. The following is the calculation equation:
Since the final output of the model is a probability distribution, in order to be able to obtain the predicted value of each tested product, this article uses a sampling method to output the predicted value. It selects
Comparison results of forecast results of commodity demand on the test dataset. (a) Test commodity 1; (b) test commodity 2; (c) test commodity 3; (d) test commodity 4; (e) test commodity 5; (f) test commodity 6.
It can be seen from Figure
Forecast error of commodity demand.
Good_ID | RMSE | MAPE (%) | MSE | MAE |
---|---|---|---|---|
192 | 2.3979 | 1.32 | 1.75 | 5.75 |
73 | 3.2787 | 1.50 | 2.25 | 10.75 |
106 | 3.4820 | 1.62 | 12.12 | 2.62 |
213 | 2.6457 | 1.41 | 7.00 | 2.00 |
50 | 2.3184 | 1.36 | 1.88 | 5.38 |
239 | 2.0311 | 1.27 | 1.75 | 5.25 |
It can be seen from Table
In this section, we conducted an ablation experiment of multimodal data, segmented. We combined the data of the three modalities to observe the influence of each part on the experimental results.
Errors in the forecast of demand for two test commodities.
Modal type | Commodity 1 (Good_ID: 192) | Commodity (Good_ID: 73) | ||
---|---|---|---|---|
RMSE | MAPE (%) | RMSE | MAPE (%) | |
A | 11.2562 | 6.25 | 10.6542 | 5.75 |
B | 10.2542 | 5.22 | 11.1320 | 6.94 |
C | 8.2545 | 5.21 | 12.1212 | 4.69 |
A + B | 7.9856 | 3.25 | 7.0982 | 3.99 |
B + C | 4.1542 | 2.76 | 3.9956 | 3.36 |
A + C | 4.5698 | 2.27 | 3.4785 | 2.65 |
Ours |
As shown in Table
RMSE and MAPE of two test commodities demand forecasts.
Since the model in this paper adopts traditional feature fusion and spatial feature fusion strategies, to deeply analyze the impact of the above two strategies on the experimental results, the feature fusion ablation experiment is conducted. We assume that
Results of feature fusion ablation experiment.
Modal type | Commodity 1 (Good_ID: 192) | Commodity 2 (Good_ID: 73) | ||
---|---|---|---|---|
RMSE | MAPE (%) | RMSE | MAPE (%) | |
16.3585 | 5.96 | 13.4856 | 6.11 | |
12.2987 | 4.21 | 9.3654 | 5.94 | |
It can be seen from Table
To further verify the effectiveness and superiority of this model, this section applies other prediction methods for verification and comparison with the same dataset. The comparison model mainly selected ARIMA and MLP-LSTM.
From Tables
RMSE comparative experimental results of different methods.
Methods | Good_ID: 192 | Good_ID: 73 | Good_ID: 106 | Good_ID: 213 | Good_ID: 50 | Good_ID: 239 |
---|---|---|---|---|---|---|
ARIMA | 26.659 | 25.3666 | 30.458 | 19.2565 | 17.0025 | 22.3695 |
MLP-LSTM | 22.0145 | 17.3695 | 25.2365 | 21.5625 | 20.6953 | 15.6526 |
Ours |
MAPE comparative experimental results of different methods.
Methods | Good_ID: 192 (%) | Good_ID: 73 (%) | Good_ID: 106 (%) | Good_ID: 213 (%) | Good_ID: 50 (%) | Good_ID: 239 (%) |
---|---|---|---|---|---|---|
ARIMA | 8.25 | 6.96 | 7.99 | 5.74 | 6.14 | 9.25 |
MLP-LSTM | 3.65 | 6.59 | 9.25 | 2.96 | 3.48 | 5.11 |
Ours |
For e-commerce companies, accurate and reliable e-commerce commodity demand forecasting is essential. This paper proposes a spatial feature fusion and grouping strategy based on multimodal data. It establishes a neural network prediction model for e-commerce commodity demand. First of all, the ablation experiment proved the positive influence of multimodal data on the prediction task. It indicates that consumer reviews and consumer portraits are important factors influencing in demand forecasting. In addition, we also found that the feature relationships between the three modal data are not independent. However, there are closely related relationships, which we call spatial relationships. The superiority of spatial feature fusion is proved through ablation experiments. Finally, the e-commerce product dataset generated by the e-commerce platform is used to test the prediction effect of the proposed model. The experimental results prove the effectiveness and superiority of the algorithm.
The data used to support the findings of this study are available from the corresponding author upon request.
All the authors do not have any possible conflicts of interest.