An Approach for Demand Forecasting in Steel Industries Using Ensemble Learning

This paper aims to introduce a robust framework for forecasting demand, including data preprocessing, data transformation and standardization, feature selection, cross-validation


Introduction
Demand forecasting indicates the prediction of the future needs of a product or service [1].It is necessary to follow a procedure to attain a crystalline graph of the demand for identifying the pulse of the customer's need to hold their position in the market.From the last era, the steel industry in Bangladesh is a fast-growing industry in the local market.
e industries managed to manufacture a large amount of steel to fulfill both local and international markets, but producing a large amount of steel without proper forecasting causes various problems.Demand prediction is used to support many fundamental business assumptions, including turnover, total revenues, income, capital consumption, chance evaluation and moderation plans, scope quantification, transportation and distribution plans, and more.Any type of misdeed assessment could cost decaying or scarcity of raw materials.It can also lead to overproduction or underproduction.All these cases erode the entire supply chain and total income, resulting in opportunity cost.Again, the entire industry setup depends on this demand, such as the amount of raw material, labor, and space.For these whole arrangements, time is also a crucial issue, as some processes have predefined deadlines that must be perfectly synchronized.For smart business strategy, the most important thing is to forecast the demand precisely but the industries do not have any intelligent method to measure the need perfectly.ey follow the time series of their sales data and often skip factors, such as raw material supply, availability, and the number of workers at the factories, significantly influencing steel production.
Forecasting methods can be classified into three categories: (1) statistical methods, (2) artificial intelligencebased methods such as single machine learning (ML) methods, and (3) ensemble/hybrid methods.Most steel industries in Bangladesh use traditional statistical approaches.Statistical approaches, such as exponential smoothing [2], moving average [3], autoregressive moving average [4], and autoregressive integrated moving average [5], are most frequently used for time series prediction.e major drawbacks of these techniques are that the parameter values are fixed using statistical calculations.e error of estimation increases when the fluctuations in the entered data are high and do not yield convincing results for complicated time series patterns [6].us, the companies need an intelligent decision support system that considers several factors.
Several researchers reveal that in the investigation of most cases, ML approaches have drawn much attention and could provide more accuracy than could traditional approaches [7].Single artificial intelligence-based models, such as support vector machine (SVM), extreme machine learning, heuristic techniques, and multilayer perceptron (MLP), are widely used in various industrial aspects to predict demands because they demonstrate promising results in the areas of control, prediction, and pattern recognition [8][9][10].Support vector regression (SVR) is popular for predicting future demand because of its outstanding generalization capability and no dependency over input space dimensionality [11].It produces higher accuracy in agribusiness prediction [8] and supply chain demand forecasting [12].Recently, MLP is used for monthly water demand prediction [10], wind speed prediction [13], and water demand prediction [14].For improving MLP's prediction accuracy, different MLP architectures were used, and an optimization algorithm was used to tune its parameters [15].e extreme learning machine (ELM) is another advanced model, which is a single hidden layer feed-forward neural network (SLFN) model with incremental learning speed and fast convergence, making it efficient and fast in learning [16].It is widely used in applications, such as sales forecasting demand of fashion retailing [17] and sales prediction for the retail industry [18].
Since demand forecasting in steel industries is considerably challenging, it is impossible to solve this problem accurately using single ML models.No single model is ideally suited for various ML applications.Each method and application domain has some prerequisites, advantages, assumptions, and characteristics [19].Generally, the performance of combined forecasting models is better than that of a single forecasting model [20].e literature has described several strategies to enhance the predictive performance of regression models, and one of these is the regression ensemble [8].e regression ensemble theory is built on ML, whose roots are related to the concept of divideto-conquer, solving the constraints of ML models working in isolation [21].An ensemble model is one in which numerous base models are constructed to address the same problem, with each model learning the dataset's feature attributes and making a prediction.As a result, the separate model's forecasts are integrated to generate the final projection.By combining the mean or weighted average, ensemble approaches for regression problems can be developed.e simple method of grouping regression ensembles by mean and weighted average is to use mean and weighted average.e regression ensemble models construct a collection of models in order to improve the predictive power of the selected models and the numerical goal variables [22,23].Ensemble methods are used in several studies, such as forecasting for energy consumption [24], agribusiness prediction [8], and wind power forecasting [25].Although numerous frameworks have been established, there is always a need for improved forecasting accuracy and robustness, particularly in the steel industry.
is study proposes a new pipeline for demand forecasting in steel industries.From this aspect, this study explores the capacity of predictive regression ensemble models by comparing the ensembles among themselves and considering the single reference models to forecast the demand.
e proposed pipeline includes data preprocessing, feature selection, hyperparameter tuning, cross-validation, and regression ensemble approaches to outperform the state-ofthe-art results.Instead of using the median value of the attribute, the mean value of the attribute is utilized to fill in the empty area since it has a more central tendency to the mean of the attribute distribution than the median.e appropriate features are selected using feature selection algorithms (correlation-based, principal component analysis (PCA), and independent component analysis (ICA)) to avoid redundancy and model overfitting problems.Different single ML techniques, such as SVR, MLP, and ELM, are adopted as reference models.e ensemble bagging (RFR), boosting (GBR and XGBR), and stacking (STACK) models are used in our proposed framework to enhance demand forecasting robustness and efficiency.
e grid search technique with cross-validation is used to select the optimal hyperparameters for each ML model.Comprehensive experiments are conducted on different data preprocessing and a combination of ML techniques to minimize the RMSE and maximize R 2 of demand forecasting models.All experiments are carried out under the same experimental settings and with the same data set as the previous experiment.Finally, e remainder of the paper is arranged in the following manner: Section 2 describes a collection of related studies for the purpose of forecasting.Section 3 illustrates the suggested approach, dataset, feature selection methods, and assessment measures.Various experimental findings are documented in Section 4 based on the interpretation of the data.Section 5 provides a conclusion as well as a scope for further development.

Related Works
Forecasting demand for industrial products is an urgent matter since a massive portion of a company's planning process is based on the amount of product to be produced.To meet the increasing demand, precise demand forecasting is required.In this section, we will discuss the work that has been done to anticipate demand in a variety of disciplines and will describe numerous exemplary studies.
Ribeiro and dos Santos Coelho [8] proposed a system for agribusiness prediction using ensemble methods.Bagging, boosting, and stacking ensembles along with single reference models named SVR, MLP, and KNN were used for their purposes.In this experiment, it was shown that ensemble methods performed better than single models.ey obtained MAPE of 0.9787 and 0.7394 for both cases for best ensemble models.ey did not apply any metaheuristics algorithm for optimizing hyperparameters.Yu et al. [9] developed an ensembling and decomposition algorithm with EEML for crude oil price forecasting.In Ref. [12], they introduced a system by ensembling regression algorithms and time series algorithms to forecast the supply chain demand.e system showed superior outcome because of the reality of invalidating the over-gauging and under-determining.Cankurt [26] employed a variety of regression models, including M5P and M5-Rule model trees, bagging, boosting, randomization, stacking, and voting, to anticipate tourism demand.In this case, they obtained R of 0.986 and a RAE of 14.96.e bagging and boosting methods have great significance for the improvement of performances in regression tree models.
Yang et al. [27] developed a system for forecasting agriculture commodities using the bagging and combining approaches with the Heterogeneous Autoregression (HAR) model.HAR model along with bagging and the principal component combination shows outstanding performance for agriculture commodities forecasting.In Ref. [28], they introduced a system by ensembling empirical mode decomposition (EEMD) to analyze global food price volatility.Tao et al. [29] proposed a method using a combination of ensemble empirical mode decomposition (EEMD), extreme learning machine (ELM), and ARIMA for forecasting hog price.
ey obtained the best-estimated accuracy of R � 0.848.Ribeiro et al. [30] designed nonlinear prediction models based on ensemble aggregation in order to improve the prediction accuracy of electricity load forecasting.In the proposed system, they used hourly load values from Italy in 2015 and Global Energy Forecasting Competition in 2012 to validate their proposed framework.Compared to the multilayer perceptron neural network (MPNN) and regression tree approach, their proposed forecasting framework based on wavelet ensemble provided a better performance.
da Silva et al. [31] introduced a decomposition-ensemble learning strategy for multi-step forward extremely shortterm forecasting, which involved aggregating many regression models.ey employed a range of preprocessing strategies to account for the system's high degree of input correlation.Across all time horizons, the proposed models outperform the CEEMD, STACK, and single models.In Ref. [32], they presented an excellent rolling decompositionensemble model for gasoline forecasting, which was both accurate and efficient.e researchers' experimental results demonstrate that the rolling decomposition-ensemble model is both accurate and resilient when it comes to projecting gasoline consumption levels and trends.A unique wind speed ensemble forecasting system (WSEFS) was developed by Liu et al. [33] in order to enhance point forecasting (PF) and interval forecasting (IF).ey obtained MAPE of 1.9322%, 2.1579%, and 2.2808% for the 1 st step, 2 nd step, and 3 rd step, respectively.e experimental results showed that the MOMA ensemble forecasting system is better than MOGWO and MODA.In order to estimate the sediment movement in open channels, Ebtehaj and Bonakdari [35] developed the ELM algorithm [35].In all training and testing modes, the FFNN-ELM outperformed the FFNN-BP and GP methods, which were previously used.For the testing mode, they found RMSE � 0.121 and MARE � 0.023, respectively.
Considering the existing literature in Table 1, it is observed that ensemble models contribute significantly to determine predictions, more than traditional models in each Complexity 4 Complexity case.Although several frameworks have already been developed, there is still a need for improvement in the accuracy and robustness of demand forecasting, especially in the steel industry.To sum up, there is up to now no proper pipeline for data preprocessing, features selection, hyperparameter tuning, and finally developed a regression ensemble method.is study uses bagging, boosting, and two-level stacking ensemble methods by analyzing the time series of historical data from the steel industry to achieve more propriety of forecasting results for demand.e steel industry follows the traditional time series trend to predict the demand, which fluctuates at a high quantity.To avoid this problem, this study combines multiple approaches instead of using a traditional single method to determine the precise result for the industry.

Materials and Methods
is section contains a concise description of the materials and method used.e suggested framework is depicted in Figure 1.
e following are the primary phases in our suggested framework: (i) collection of industrial environmental data as the primary inputs of the framework; (ii) preprocessing the data including filling the missing values, Yeo-Johnson transformation, and standardization; (iii) discarding the irrelevant and redundant features to avoid overfitting of the models; (iv) applying the grid search algorithm with cross validation for hyperparameter tuning for each machine learning model; (v) development of two-level stacking ensemble method, where machine learning models with optimal hyperparameters are used as the baseline model; and (vi) evaluation metrics used to evaluate the proposed framework.
ese blocks are explained in the following sections.

Data Collection.
e data were collected from a wellknown prominent steel company named Bangladesh Steel Re-Rolling Mills Ltd., in Chittagong, Bangladesh.During the industrial attachment, some raw data were procured from sources, such as workers, production leaders, and human resources.Later, the data were closely knitted to build the dataset.
e dataset comprises 132 cases and six input features from January 2009 to December 2019 (11 years).e key responsibility is to identify the demand of every month based on other factors.e dataset holds the amount of raw material used in a month, availability, the number of workers, working days, and other attributes.e data were gathered from their monthly and annual industrial reports from their official website, such as financial reports, production reports, and some other necessary factors directly affecting their production achievements.Table 2 describes each feature and shows a statistical summary.

Data Preprocessing.
e data preprocessing stage comprises missing value imputation and power transformation of data.Raw data inherit some missing attributes from various features that must be filled before applying any ML technique.Several imputation techniques can fill missing values.In our proposed method, the mean-based imputation technique is used, where the missing value is filled with the mean of the attributes of that specific feature.
After the imputation of missing or null values, the data power transformation is performed.In regression analysis, transformations are crucial [36].Parametric, monotonic transformations are power transformations used to make data more Gaussian-like.
is technique is useful in heteroscedasticity problems or other circumstances where data normality is required.Among the two most popular power In this step, all ensemble and reference models are trained by LOOCV.In addition, hyperparameters are tuned using the grid search algorithm during cross-validation.
In the sequence, predictions are obtained.
In this paper, LASSO and SVR with the linear kernel are adopted as meta-learner.After training each meta-learner predictions for test set are obtained.
In this step, the predictions for the test set are obtained from meta-learner and the performance measures (R 2 , MAE, MSE, RMSE, and MAPE) are obtained.
In this step, predictions from step2 are combined ( 2 in 2, 3in 3, 4 in 4, and 5 in 5) and used in layer-0 of meta learner.

Preprocessing Feature Selection
Step 1 Step 2

Meta-Model Meta Learner-Training
Predictions-1 Predictions-2 Predictions-M Performance Measures Step 3 Step 4 e description of the Yeo-Johnson transformation can be given using where y * is the transformed value, y is a list of n strictly positive numbers, and λ is a hyperparameter used to control the transformation.Here, Scikit-learn implementation of PowerTransformer (method � "Yeo−Johnson,", * , standardize � True) is used, performing the Yeo-Johnson power transformation operation with implicit data standardization with zero mean and unit variance to the transformed output.

Feature Selection.
Feature selection or reduction reduces irrelevant, redundant, or partially important features that might mislead the model prediction, as the accuracy of an ML model depends on the features on which it has been trained.Feature reduction reduces the chances of overfitting because of the reduction of the redundant feature and lessens the model's complexity.Several feature selection or reduction techniques exist.In our proposed method, PCA, ICA [36], and correlation-based feature selection algorithms were used to discard irrelevant features.
PCA is frequently employed in this capacity due to its adaptability and ease of implementation.PCA works on the premise of dividing data into an orthogonal space so that the eigenvectors corresponding to the greatest eigenvalues preserve the maximum data variance.PCA is a technique that focuses on the covariance matrix and second-order statistics.ICA decomposes observable data linearly into statistically independent components.For the correlationbased method, it classifies characteristics using a heuristic evaluation function that takes into account the correlation between the target outcome and their features.e design structure of both PCA and ICA follows the default implementation of Scikit-learn except the n_components parameter, resembling the number of features to be chosen by the respective algorithm, as the value of the parameter is driven from hyperparameter tuning.e design of PCA can be illustrated, respectively, such as (n_components, copy, whiten, svd_solver, tol, iterated_power) � ({4, 5, 6}, True, False, auto, 0.0, auto).Algorithms 1-3 summarize the procedures of PCA, ICA, and correlation-based feature selection algorithms, respectively.e correct combination of values is significant in achieving the best and quality model.Choosing the correct values for the optimal model is known as hyperparameter optimization or hyperparameter tuning [38].Grid search and random search are both well-known techniques when tuning the hyperparameters of an estimator.is study used the grid search method based on cross-validation, resulting in the most precise predictions [39]. is algorithm splits the range of parameter values to be upgraded into the grid and across all points to obtain the optimal parameters.Different parameter combinations were evaluated for each model, which were divided into training and test sets using the cross-validation method [39].Table 3 provides an overview of hyperparameters tuned using ML techniques and their range of tuning.

Cross-Validation in Time
Series.Cross-validation is a widely used validation approach for tuning hyperparameters and assessing the effectiveness of machine learning techniques [40].Different parameters must be stated for each case depending on the dataset.A grid search technique combined with cross-validation is effective at identifying the optimal hyperparameter combination for each model.As a consequence, forecasting errors associated with test samples may be decreased, allowing for the determination of the ideal Complexity collection of hyperparameters that enhance predictive performance while minimizing model overfitting [41].e leave-one-out cross-validation procedure is acceptable in this scenario when dealing with time series data [42].Alternatively, this method can be considered a sequential block cross-validation procedure and a subset of K-fold crossvalidation.us, the training set is iteratively constructed, with the training and validation sets being utilized concurrently, a process known as rolling cross-validation. is procedure is performed several times, with each iteration increasing the amount of observations in the training set and decreasing them in the validation set.e associated training set comprises only observations that happened before the observation in the test set.
e dataset is partitioned into training and test sets, with 70% of the data used for training and verifying the models.e time series split notion is to divide the training set in half at each iteration, assuming that the validation set is still ahead of the training split.It is initially trained on a limited subset of data in order to forecast the next data point.Following that, the forecasted data points are incorporated into the succeeding training dataset, and subsequent data points are forecasted.is process is repeated until the complete training set has been utilized.Calculate the training outcome by estimating iteration performance assessments.

Structure of Stacked Ensemble
Modeling.STACK modeling was conducted by considering two stages, level 0 and level 1, and the predictions of the base learner (level 0) are combined with the meta-learner (level 1).From the previous studies, it is shown that the support vector regression (SVR) and selection operator (LASSO) regression are used as the meta-learner [8,25].e key advantages of adopting SVR, and especially layer-1 in the STACK technique, are its ability to identify predictor nonlinearities and subsequently exploit Input: mdimensional input data matrix X ∈ R m with number of samples N, and variance threshold T var Output: reduced Ldimensional data matrix Y ∈ R L L < m, Load X ∈ R m , and calculate mean for each feature, μ j � 1/N  N i�1 X ij for j � 1, 2, . . ., m; subtract the mean from each corresponding dimension, X ij � X ij ′ − μ j for j � 1, 2, . . ., m and i � 1, 2, . . ., N; / * Make each signal uncorrelated to each other * / Calculate covariance matrix of X′,  m×m 1/N − 1[X′] T � X′; Solve the  m×m as  m×m � V − 1 DV, where V ∈ R m is the matrix of eigenvector and D m×m is the diagonal matrix containing eigenvalues on both sides of the diagonal matrix ; Sort the eigenvector matrix V in the descending order to the first Leigenvector that have variance ≥T val and form a projection matrix P m×L ; Finally, project on the PCA space, Y � P T X; ALGORITHM 1: Steps for the implementation of principal component analysis (PCA).Input: mdimensional input data matrix X ∈ R m with number of samples N, and expected outcome,

􏽱 􏽲
Sort the correlation r pO in descending order to choose first L features for Y ∈ R L ; ALGORITHM 3: Steps for the implementation of correlation-based feature selection (Corr).8 Complexity them to improve demand forecasts [8].e SVR with linear kernel and selection operator (LASSO) regression model was utilized as a meta-learner in our experiment (level 1).e following steps were adopted in this work. ( where y � 1/T  T t�1 y t and in this paper, training set t � 1, . . ., 90 and test set t � 91, . . ., 132 are adopted. Along with the performance evaluation matrix mentioned above, several statistical tests [43,44] are performed in this study to ensure the superiority of the proposed approach.e Friedman test is used to examine if the absolute percentage errors (APE) of the two models differ statistically significantly.Once statistical significance has been established, post hoc tests (nonparametric tests), such as the Wilcoxon signed-rank test, can be employed to assess if the APEs of the models change when compared to one another (lower tail) [44,45].Wilcoxon's null hypothesis indicates that there is no difference in APE between models 1 and 2, but the alternative hypothesis states that model 1 has a lower APE than model 2.

Experimental Results and Discussion
In this section, the preparatory analysis of steel industrial data used in this study is demonstrated in Section 4.2.e performance of the adopted models and statistical tests for test set errors are described in Section 4.3.Tables S1 and S2 represent the performance measurement indices of the 56 generated models.

Experimental Setup. A single computer (Asus X556U
with an Intel ® Core (TM) i5−72000U, central processor unit running at 2.50 GHz, 8.0 GB of random access memory, and an Nvidia GeForce 940MX graphics card) running the Windows 10 operating system was used to create the findings provided in Section 4. In order to implement the machine learning approaches and ensemble methods, we used the Python 3.6 programming language in conjunction with the Spyder computing environment, which is included in Anaconda.

Exploratory Analysis.
Correlation analysis is a statistical approach used to determine the connection between two numerical variables.From an ML viewpoint, it indicates how the features correspond to the outcome.However, it is challenging to identify how features are interconnected.Data visualization can help determine how individual features might correlate with the outcome.Pearson's correlation coefficient is used to identify the relationship between two variables in a statistical analysis.In the range of +1 to −1, it means that there is no correlation at all, +1 indicates that there is a perfect positive correlation, and −1 10 Complexity indicates that there is a perfect negative correlation, according to the definition.After the Yeo-Johnson transformation has been performed to the training data set, the correlation matrix for the exploratory variables is shown in Figure 2. Figure 2 depicts the color scale of its association, which is represented on the righthand side of the illustration.e light color indicates a close relation of 0, whereas the intense color indicates a close relation of +1 or −1.e indicators (F 1 , F 2 , and F 3 ) and the response variable (Demand) are highly positively correlated.us, the increment or decrement in the value of one tends to increment or decrement those that are highly correlated.However, indicator (F 5 ) is negatively correlated to the outcome (Demand), indicating that if the number of holidays in a month increases, the number of demands decreases and vice versa.

Evaluation of Proposed Models.
In this study, the proposed models are trained using a set of optimal hyperparameters achieving the maximum predictive performance of each model achieved by grid search.e steel production data from January 2009 to December 2019, covering 132 months, are taken as the training and testing sets.Table 3 presents an overview of hyperparameters tuned for each ML model, their explanation, and turning ranges.Table 4 represents the quantified results for selecting the best performing preprocessing and the number of selected features and ML models, where R 2 with standard deviation is stated for comparison.Table 5 summarizes each model's capacity to obtain the highest R 2 using the suggested pipeline, along with the optimal preprocessing and feature selection algorithms and the number of selected features.In addition, Table 5 illustrates the best-tuned hyperparameters using the grid search.e analysis of Table 4 reveals that when suitable preprocessing is used, various models produce superior outcomes.e different architectures of the MLP model are shown in Table 6.Table 7 summarizes the performance metrics used to evaluate each model, which include R 2 , MAE, RMSE, and MAPE.When either correlation-based or PCA-based feature selection is applied, each model achieves the best results for filling missing values, Yeo-Johnson transformation, and data normalization (Tables 4 and 5).For SVR, the estimated accuracy of R 2 � 0.931 is obtained from preprocessed data and correlation-based feature selection.
e comprehensive experiments were performed on the same dataset to get the best architecture for the MLP model.Eight separate MLP models (Table 6) were implemented and evaluated, with 1-7 hidden layers, where the number of neurons served as a hyperparameter for selecting the best numbers.e experimental results in Figure 3 indicate that the optimal architecture is the MLP layout with M � 4 hidden layers (H 1 , H 2 , H 3 , and H 4 ) and N 1 � 12, N 2 � 12, N 3 � 12, and N 4 � 8 neurons.In addition, the presence of additional hidden layers with fewer samples, like in the steel dataset, limits the MLP model's capability (Figure 3).Because of the limited data, such as in the steel dataset, the wide depth of the MLP model could be overfitted and cause gradient fading problems.Table 3 lists the optimal hyperparameters of the best MLP model.e models have used the ReLU activation function and Adam solver.It was trained on 200 epochs with a constant learning rate, batch size, and a regularization parameter of 0.01, 32, and 0.1, respectively.To reduce overfitting, the dropout layer was used, randomly dropping 60% of neurons.e highest accuracy R 2 from the MLP model is 0.961 when we perform data preprocessing and PCA-based feature selection.Similarly, the ELM model with eight neurons in the hidden layer obtained the best result.Table 3 lists the optimal hyperparameters of the best ELM model.e model used the ReLU as the transformation function of hidden layer neurons, and the optimal regularization parameter was 0.001.e best-estimated accuracy (R 2 ) of the ELM model with preprocessed data and correlation-based feature selection is 0.942.
Feature selection methods are used to improve the overall performance of each model (correlation-based, PCA, and ICA).It is possible to reduce the dimensions of a higher-dimensional space to a lower-dimensional space using PCA by selecting the orthogonal projections with the highest variance.
e ICA theory implies that data are only partly independent if their variances across characteristics are larger than their covariance.e number of computers being used has a significant impact on PCA performance.Because the ICA-based feature selection technique is used to find newly specified mutually independent components, it is possible that correlation with the desired output will be lost when the procedure is used to discover new predefined mutually independent components.Due to the fact that both PCA and ICA create new components in an unsupervised manner, it is not possible to guarantee greater performance on the steel dataset.Correlation-based feature selection, on the other hand, takes into consideration the relationship between quality and outcomes in order to discover the most closely related features.As shown in Table 4, the majority of models perform better when four features, F1, F2, F3, and F6, are used.ese four features were chosen using a correlation-based feature selection technique.
Further improvement of demand forecasting was obtained using regression ensemble models.Bagging (RFR), Boosting (GBR and XGBR), and stacking (STACK) regression ensembles were adopted to improve the performance of demand forecasting.Table 5 presents the performance evaluation of the adopted models.Furthermore, the results are sorted regarding R 2 in the ascending order for the test set results.Finally, the best models present the lower RMSE and higher R 2 in the test set.RFR is the ensemble learner built-in unpruned decision tree, and it reduced the effects of overfitting by combining multiple trees.Table 5 shows the optimal hyperparameters for the RFR model.e best-estimated accuracy (R 2 ) of the RFR model is 0.966 obtained from preprocessed data and PCA-based feature selection.e RFR performance of the models is better for SVR, MLP, and ELM in terms of the RMSE, that is, it has lower RMSE values.GBR and XGBR are also used to increase the accuracy of forecasts.Extreme gradient boosting is a specific variant of the gradient boosting strategy that discovers the ideal tree model by employing a more exact approximation than the conventional gradient boosting method.e best-estimated accuracy (R 2 ) of the GBR model is 0.969, obtained from preprocessed data and correlation-based feature selection.e XGBR can reduce the loss by showing an extreme gradient capability.e highest accuracy (R 2 ) of XGBR is 0.974, and the lowest RMSE is 0.151.e RMSE of XGBR is significantly lower Complexity than the reference models and RFR and GBR. e best result of XGBR is obtained when a child's minimum amount of weight is less than 4, and a subsample ratio to construct a tree is 0.7.
Finally, the stacking ensemble method is used for integrating multiple-base models in order to reduce prediction errors to the smallest possible amount.According to the results from the test set, level 0 of the STACK 1 method is   7, based on the findings of the test phase, the approaches based on ensemble learning produced results that were compatible with the objective of minimizing error.Figure 4 illustrates the violin graph for the APE distribution of each model that was utilized to produce predictions for the test set, as shown by the APE distribution of each model.e mean APE is shown by the white dot in the center of the chart.Ensemble-based techniques, as compared to other models, significantly lower the APE to the absolute bare minimum.In this way, we can show that a model (for the test set) with lower metric values in Table 7 has a more stable APE and less volatility than a model with higher metric values.e Friedman test established that the APEs for the accepted models varied in the test set (χ 2 7 � 72.1875, p − value < 0.05). is implies that there exist models with observed APE values that are equal to or less than those of the others.In addition, Table 8 depicts the results of the Wilcoxon signed rank test (lower tail) for measuring the APE reduction of the assessed models in the test set, in the presence of a statistically significant difference as revealed by the Friedman test (χ 2 7 � 72.1875, p − value < 0.05).
At the 5% level of significance, the APE of the STACK 1 model is fewer than the APEs of the RFR, MLP, ELM, and SVR models, as shown in Table 8.It is statistically equivalent when the STACK 1 model is compared to other models with error rates at the 5% threshold of statistical significance.In addition, when the 5% threshold of significance is utilized to compare the models, Table 8 reveals that the APE of the STACK 2 model is lower than the APEs of the RFR, MLP, ELM, and SVR models.Using the % level of statistical significance, the STACK 2 model is compared to other models, and the errors are statistically equivalent. is highlights the advantages of the stacking ensemble models that we provide.Ensemble-based models, on average, have a lower APE than ELM and SVR.As a result, the ability of this approach to learn the data could be described using smaller estimation errors and variance between the ensemble methods than with the others, confirming the validity of this methodology.At the 5% level of significance, the APE of the STACK 1 model is fewer than the APEs of the RFR, MLP, ELM, and SVR models, as shown in Table 8.When the STACK1 model is compared to other models, the errors are statistically equal at the 5% level.Similarly, Table 8 reveals that the APE of the STACK 2 model   and feature selection procedures are critical.e proposed preprocessing scheme improves the raw dataset quality, where filling the missing values and data standardization are the main concerns.e Yeo-Johnson transformation is used to influence the features and response variables.While PCA and ICA solely focus on interfeature redundancy, correlationbased feature selection might improve interfeature correlation.Hyperparameters are tuned to find the optimal hyperparameter set for each ML technique using a grid search algorithm.
e best-performing models are combined in 1 to form level 0. SVR with linear kernels and LASSO regressions are adopted as meta-learners in level 1.
e Friedman and Wilcoxon signed-rank tests (lower tail) are used to validate the models' APE differences.Regarding the findings, two models may be used to forecast one month as follows: STACK 1 (ELM + GBR + XGBR-SVR) and STACK 2 (ELM + GBR + XGBR-LASSO).e test set results demonstrate that ensemble approaches outperform single models, notably the STACK model, in forecasting demand the industry.
Future will (i) develop other ensemble techniques and integrate other ML regression techniques into the ensemble; (ii) include other influence variables such as occasion and political factors; (iii) collect more information, in this case only 132 months of production data are used; and (iv) extend to other industrial fields to evaluate their generality and flexibility to predict several types of demand.[36,46,47].
step, data set is collected.The input are industrial related features obtained from streel inadustry and output is demand quantity.Next, Yeo-johnson preprocessing is adopted.

Figure 1 :
Figure 1: Proposed architecture of automatic demand forecasting.

3. 4 .
Hyperparameters Determination.Hyperparameters define those values directly controlling the learning process of ML techniques and can be arbitrarily set by the user before starting the training phase.

F6Figure 2 :
Figure 2: Correlation matrix for all influence features corresponding to the demand.

Figure 3 :
Figure3: Performance of several MLP architectures with the purpose of picking the optimal one with the maximum accuracy ((R)2 ), where the best corresponding models are presented in Table6.

Figure 4 :
Figure 4: Violin plot to represent the APE of the models.

Table 1 :
Summary of most recent works for demand forecasting in various fields with their input factors and performances.
number of samples N, and variance threshold T var Output: reduced Ldimensional data matrix Y ∈ R L L < m, Select a nonquadratic nonlinear function g; Initialize W as X � WH, where W← ratio of source during mixing, H← matrix contains different components, and X← mixed Derive the new dataset by taking Y � W T X, where Y ∈ R L ; ALGORITHM 2: Steps for the implementation of independent component analysis (ICA).
Estimating the model's accuracy is crucial in designing ML models to define how well the model is predicting.It is used to determine the goodness of fit among models and data to compare various models for model selection.If y 1 , y 2 , . . ., y t are T actual values and y

Table 3 :
Different machine learning techniques with hyperparameters to be tuned by the grid search algorithm during cross-validation.
set is the set of optimal hyperparameter for each based regression model, M is number of based model T. Output: final forecast demand level Y ∧ f and performance indices.Step 1: learn first-level base regression models; / * Loop for train and evaluate the first-level individual /regressor * for t←1toT do Divide the dataset D into D train and D test ; / * 70% data for training and validation, 30% for test set * / / * Leave-One-Out Cross-Validation * / for i←1 to K(K←size of D train ) do ′ � h 1 , h 2 , . . ., h t  , h t ← output of i th model, l← number of based model; Step 3: learn second-level regressor model; / * Loop for train and evaluate the final-level meta-regressor model * /for j←1 to K(K←size of D train ) do Predict the demand level for H meta with D′ test are used for the prediction and performance measure (P Hmeta ) using H meta return P Hmeta ; ALGORITHM 4: Demand forecasting using Stacking Ensemble techniques using cross-validation.

Table 4 :
Summary of all extensive experiments to select the best performing preprocessing feature selection methods with the number of features and regression models.Model 35, on the other hand, is selected for the STACK 1 technique because its complexity is smaller than that of other configurations, and it has the lowest MAPE.In a similar process, the models numbered 33, 35, 50, and 56 in TableA2exhibit the same level of performance (R 2 ).For the STACK 2 technique, model 35 is also picked because its complexity is lower than that of other configurations, and it has the lowest MAPE of any of the models tested.e best- * N/A � none.Note: the best approaches were shown in bold type.

Table 7 :
Comparing stacking ensemble model with the best performing ML models.