Prediction Interval Construction for Byproduct Gas Flow Forecasting Using Optimized Twin Extreme Learning Machine

Prediction of byproduct gas flow is of great significance to gas system scheduling in iron and steel plants. To quantify the associated prediction uncertainty, a two-step approach based on optimized twin extreme learning machine (ELM) is proposed to construct prediction intervals (PIs). In the first step, the connectionweights of the twin ELMare pretrained using a pair of symmetricweighted objective functions. In the second step, outputweights of the twin ELMare further optimized by particle swarmoptimization (PSO). The objective function is designed to comprehensively evaluate PIs based on their coverage probability, width, and deviation. The capability of the proposed method is validated using four benchmark datasets and two real-world byproduct gas datasets. The results demonstrate that the proposed approach constructs higher quality prediction intervals than the other three conventional methods.


Introduction
In the iron and steel industry, the utilization of byproduct gas is of great significance to reduce the cost of fuel consumption and greenhouse gas emissions [1,2].The byproduct gas is generated from the iron and steel making progress and supplied to many plants (e.g., hot-rolling and power plant) via gasholders to be used as a fuel.However, since gas holders are limited in capacity, temporary excess or shortage of byproduct gas usually occurs.To solve this problem, the byproduct gas users need to be scheduled in advanced according to prediction information.Therefore, accurate forecasting for byproduct gas flow generation and consumption has become a meaningful tool to ensure the reliability and economy of energy system control and scheduling [3,4], and this forecasting problem has attracted a large amount of interest in both industry and academia [5][6][7][8][9][10].
In recent years, artificial intelligence methods such as neural networks (NNs) [6][7][8] and support vector regression (SVR) [9,10] have been adopted for handling byproduct gas forecasting.The outstanding advantage of artificial intelligence is that it can address nonlinear problems.
However, most of these models focus on point forecasting, which provides only deterministic forecasts with no information about the prediction probability.In practice, there is a large amount of uncertainty originating from temperature fluctuations and measurement errors, for example.The forecasting uncertainty will affect the decisionmaking process and increase the risk of scheduling, so it is imperative to quantify the uncertainty of prediction.The prediction interval (PI) is a well-known tool for quantifying the uncertainty of prediction.The PI provides not only a range within which the target values are highly likely to lie but also an indication of their accuracy [11,12].
Although there have been only a few studies on byproduct gas flow interval forecasting, the construction of PIs in other application areas [13][14][15][16][17] has been studied for many years.Traditional approaches, such as the Bayesian method [14], mean-variance estimation (MVE) method [15], and bootstrap method [16], have been proposed for constructing PIs.The central idea of the Bayesian method is that the model weights are considered random variables, and the probability density distribution of the weights is estimated by the recursive Bayesian method.Despite the strength of the supporting theories, the Bayesian method lacks proper adaptability to complex noise (e.g., heteroscedastic noise) because the model weights are assumed to be in accordance with an isotropic normal distribution [17].The MVE method first generates a point prediction; then, a model is established to estimate the variance.The underlying assumption of this method is that the point prediction is equal to the true mean of the targets, but this condition is difficult to meet [18].The bootstrap method is a widely used method for the construction of PIs.An ensemble of models is established to produce a less biased estimation of the point and variance prediction.The main disadvantage of this method is the high computational cost [18].In recent years, a direct interval forecasting method called lower and upper bound estimation (LUBE) was proposed [19].The main idea behind the method is to construct lower and upper bounds of PIs directly by optimizing the coefficients of the NN according to the interval quality evaluation indices.Because it offers good performance and does not require strict data distribution assumptions, the LUBE method has been widely used in many real-world problems, such as bus travel time prediction [20], electrical load forecasting [21,22], and Landslide displacement prediction [23].However, conventional NNs employed in the LUBE method suffer from the problems of overtraining and a high computational burden.Alternatively, the extreme learning machine (ELM) [24] is a new kind of feed-forward neural network with a single hidden layer.The output weights are the only coefficients that need to be trained.Thus, ELM exhibits a faster learning rate and a better generalization capability than conventional NNs [25].
However, when using NNs to perform interval forecasting, the initial values of the connection weights are usually generated randomly [19][20][21][22].The effects of connection weight initialization on the final constructed PIs are usually ignored.In contrast to conventional NNs, ELM trains only the output weights instead of training all of the connection weights.However, these output weights still constitute a relatively large search space.Furthermore, the cost function of LUBE is the coverage width-based criterion (CWC), which cannot comprehensively describe the performance of the constructed PIs.Considering the shortcomings of the LUBE method, we propose a new two-step method based on the optimal twin ELM for constructing the PIs.Specifically, in the first stage, the twin ELM is pretrained using a pair of symmetric weighted objective functions to construct the raw PIs.Then, the values of the input weights and bias are fixed, and only the output weights are further adjusted by PSO.Furthermore, a new cost function called coverage width and deviationbased criterion (CWDC) is proposed.In the CWDC, in addition to the coverage probability and the width of the PIs, the deviation of the PIs is also considered, which gives a more comprehensive description of PI performance.The main contributions of this paper are as follows: (i) A twin ELM is adopted to construct PIs and the output weights of the twin ELM are pretrained using a pair of symmetric weighted objective functions.The pretraining method offers reasonable initial values and, as a characteristic of the ELM, only the output weights need to be tuned, which benefits the subsequent optimization process.
(ii) A modified cost function called CWDC that considers the deviation of the PIs is proposed.CWDC provides a more comprehensive description of the PI performance.
(iii) Experiments based on four benchmark datasets and two real-word byproduct gas flow datasets are performed to illustrate the capability of the proposed approach.
The remainder of this paper is organized as follows.Section 2 introduces the related work, namely, generic ELM, formulation, and performance indices of PI.Section 3 presents the method proposed for PI construction.Performance evaluation on benchmark datasets is presented in Section 4. In Section 5, the proposed method is applied to byproduct gas forecasting problems.Finally, Section 6 concludes the paper.

Preliminaries
2.1.Generic ELM.ELM is a type of single-layer neural network proposed by Huang et al. [24].It is well known for its fast training speed.Given a set  = {(x  ,   ),  = 1, 2, . . ., }, where x  is an input vector, and   is the corresponding target, we assume that ELM has ℎ hidden nodes and that the activation function is ().The mathematical formula of ELM can be expressed as where   is the weight vector that connects the th hidden neurons and the output neuron.Here,   is the connection weight between the corresponding input neuron and the th hidden neuron, and   is the bias of the th hidden neuron.Additionally,   and   are generated randomly, and they remain unchanged during the training progress.Therefore, formula (1) can be simplified as where H = [ According to the Moore-Penrose inverse theorem,  can be calculated in the sense of a least-square estimation as follows: 2.2.Formulation of PIs.Given a set  = {(x  ,   ),  = 1, 2, . . ., }, x  is an input vector, and   is the corresponding prediction target.The PI with nominal confidence (PINC) 100(1 − )% for the target   can be expressed as where   (x  ) and   (x  ) are the lower and upper bounds of the target   , respectively.Thus, the probability that   lies within   (x  ) is expected to be 100(1 − )%, as expressed by the following equation: where  is the number of data samples and According to the concept of a PI, the value of the PICP is expected to be greater than or equal to the predetermined confidence level; otherwise, the PIs are invalid.
A relatively high PICP can be easily achieved if the width of the PIs is sufficiently large, but wider PIs are less informative in practice.The PINAW, which can quantitatively describe the width of the PIs, is defined as where  max and  min , respectively, represent the maximum and minimum values of the targets.
To construct PIs to provide satisfactory performance, a higher PICP and lower PINAW are expected.However, the two indices conflict with each other.To find a compromise between them, a combined measure called coverage widthbased criterion (CWC) has been proposed [19].The CWC is defined as where and  are two hyperparameters of the CWC.Usually,  is set to the PINC, and  determines the amount of punishment when the PICP is lower than PINC.

The Proposed Method
3.1.Framework Overview.Our proposed method is derived from the LUBE method.As ELM is a new type of NN with simple structure and fast training speed, we adopt a twin ELM to construct the lower and upper bounds of PIs.The overall structure of the proposed method is shown in Figure 1.A two-step approach is proposed to determine the connection weights of the twin ELM.The flowchart of the proposed method is described in Figure 2.
In the first step, the connection weights of the twin ELM are pretrained using a pair of symmetric weighted objective functions to construct raw prediction intervals.By using a symmetric pair of weights, two ELM models (ELM up and ELM low ) can be obtained to generate the lower and upper bounds of the raw PIs.Then, as a characteristic of ELM, the values of the input weights are fixed, and only the output weights are further adjusted by the later optimal process.
Then, the second step is to improve the quality of the PIs by further adjusting the output weights based on PSO.

Objective Function.
In LUBE method, CWC can provide a good compromise between PICP and PINAW.It cannot describe the deviation of PIs, however.In order to evaluate PIs comprehensively, a modified cost function called CWDC that considers the deviation of the PIs is proposed.The CWDC is defined as where  is a parameter to adjust the punishment of the deviation.In this paper, it is set to 20.PI normalized average deviation (PINAD) is to quantitatively describe the deviation degree from the real values to the PIs and is defined as where  The training process is to minimize the objective function.However, when the PIs are poor-quality with zero-width (PINAW = 0%) and extremely low coverage probability (PICP = 0%), CWC will equal the minimum value 0. This will lead to a misleading evaluation of PIs.As for CWDC, by taking the deviation of PIs into consideration, the problem can be solved.

Weighted Pretraining of the Twin ELM.
In the proposed method, the pretraining is designed to get reasonable initial values for the output weights before optimization.In the LUBE method, the initial values of the connection weights are usually generated randomly and the training process is generally slow because there is a large search space.In this paper, the output weights of the twin ELM are pretrained using a pair of symmetric weighted objective functions.At the beginning of the following optimization process, initial population of PSO is generated around the pretrained output weights as reasonable initial values.
The details of the pretraining process are elaborated as follows.For a given set  = {(x  ,   ),  = 1, 2, . . ., }, x  is an input vector, and   is the corresponding prediction target, according to (2) and ( 4), and point forecasting ŷ of an ELM can be formulated as where The residual   of the th sample can be formulated as The twin ELM consists of two ELMs, which have the same input weights and bias as shown in Figure 1.For the ELM that generates the upper bound, a greater weight is given to the data that has a positive error.Otherwise, the ELM that generates the lower bound gives greater weight to the data that has a negative error.The weight functions are defined as follows: In the above two equations,  ∈ (0, 1) represents the confidence level.To reduce ELM overfitting, the norms of the weights are introduced to the loss function.As a result, the mathematical model of the twin ELM can be described as min min where Here,  is a positive weight factor for adjusting the proportion of empirical risk and structural risk.
The Lagrangian of ( 19) is defined as where  is the Lagrangian multiplier.In (21), the subscript up is omitted for brevity.According to the Karush-Kuhn-Tucker (KKT) theorem, the optimality conditions are given by According to (22), we can calculate  by the following equation: The method for solving (20) is the same as that for (19) and is therefore omitted for brevity.Then, the raw PIs constructed by the twin ELM can be expressed as where

Weight Adjustments by PSO.
The optimization stage is performed to further adjust the connection weights of the pretrained twin ELM.The goal of the optimization is to achieve the PIs of the best quality through optimizing the connection weights with respect to the objective function (12).Because the objective function of the proposed method is complex and nondifferentiable, gradient descent based techniques are not suitable for this problem.PSO is an efficient and gradient-free algorithm based on the principle of bionic metaheuristic [26] and was applied to solve a large number of optimization problems in recent years [27][28][29].
Accordingly, PSO is used to determine the final values of the twin ELM output weights.
Suppose that there are a group of  particles in a dimensional search space.The position of the th particle is x  = ( 1 ,  2 , . . .,   ), and its velocity is   .Each particle stores the optimal location of its flight k  = (V 1 , V 2 , . . ., V  ).The optimal location of all of the particles in the particle swarm is the globally optimal position   .Each particle adjusts its state according to (26): where  represents the number of iterations and  is the inertia weight that affects the flying speed of the particle.
Here,  1 and  2 are the acceleration coefficients, and  1 and  1 are random variables that are uniformly distributed on the domain [0, 1].With the continuous updating of the position and velocity, the particle swarm can find the optimal solution in the space.The inertia weight has an effect on the velocity of the particle.A larger value is appropriate for the global search, and a smaller value is appropriate for the local search.In the standard PSO algorithm, the inertia weight factor is a fixed value.To balance the search ability of the algorithm, we adopt the exponential decline algorithm to adjust the inertia weight.The inertia weight in the th interaction can be calculated based on the following equation: In (27),  is an attenuation coefficient.The details of how PSO tunes the output weights of the twin ELM are as follows: (1) Initialize the parameters of the modified PSO.Specify the maximum number of iterations, the population size, and the parameters of the flying progress, including the inertia weight  and the acceleration coefficients  1 and  2 .
(2) Initialize the first population.Note that each particle in the population represents a set of output weights, including  up and  low .The output weights are generated around the pretrained output weights by adding small random values, which are uniformly distributed between −0.1 and 0.1 to preserve the diversity.
(3) Calculate the fitness value of the target.For each particle in the population, the fitness value is calculated using (12).(5) Update the particle position and velocity according to (26) and ( 27).
(6) Repeat steps (3)-( 5) until the iteration number reaches the preset threshold value.Then, return the smallest value of the cost function CWDC and the best positions  up and  low .
(7) Evaluate the PIs that are generated by the optimal the twin ELM based on the test data.

Benchmark Test
4.1.Datasets.The effectiveness of the proposed method is tested on four benchmarks.The details of the four benchmarks are described as follows.
Case 1 is a 5D mathematical function that is usually used for evaluating regression models, and it is defined as where   ∈ [0, 1],  = 1, . . ., 5. A normally distributed noise (/ ratio = 4) is added to the outputs.Case 2 is a synthetic dataset used to model the data with heteroscedastic noise.The dataset is generated by the following function: where  ∈ [−10, 10].A normally distributed zero-mean noise that has an input-dependent variance () = 0.45 sin(0.6− 2) + 0.55 is added to the outputs.Case 3 investigates the relationship between the personal characteristics, dietary factors, and plasma levels of betacarotene [30].The datasets contain 315 observations on 12 input variables and 1 output variable.
Case 4 is taken from the UCI repository [31], which is collected from a combined cycle power plant.The input variables contain the temperature, ambient pressure, relative humidity, and exhaust vacuum.The target is to predict the next hourly electrical energy output.

Experiment Methodology. The comparison results of various prediction methods are discussed in this section.
To evaluate the performance of the pretraining method, the proposed method and the ELM without pretraining are first compared.Then the proposed method using CWDC as the objective function to is compared with the ELM using CWC (ELM-CWC method).The effectiveness of proposed method is also demonstrated by comparing it with the bootstrap method and the Bayesian method.The PI evaluation indices PICP, PINAW, and PINAD for the four cases are shown in tables.Considering the inherent randomness of PSO and ELM, each experiment is repeated 10 times for all the methods.The results of the median values of the indices are recorded.
The parameters of the PSO are assigned as follows:  1 = 1.5,  1 = 2.0, and  = 0.99.The population size is set to 100, and the maximum number of iterations is assigned to 400.In this case, the PINC is set to 90%, and  equals 0.1.For CWDC, the parameters are set as follows:  = 90%,  = 20, and  = 50.
For each dataset, the training set includes 200 samples and additional 100 observations is used for testing the models.To determine the optimal number of hidden nodes of ELM, fold cross validation is adopted.The training data are divided into  equal parts.In our study,  is set to 5. The model is trained on -1 parts, and the remaining part is used to validate the trained model.As a result, the optimal numbers of hidden nodes for Cases 1 and 2 are selected to be 12 and 15, respectively.The optimal numbers of hidden nodes selected for Cases 3 and 4 are 20 and 18, respectively, based on the experimental results.

Comparison of Different Initialization Methods.
In the proposed approach, pretraining method is designed to provide a reasonable initial value for the following optimization process.To evaluate the effectiveness of the pretraining strategy, randomly initialized ELM (RI-ELM method) is employed as a comparative approach.In RI-ELM, all the weights of ELM are given randomly and the output weights are directly tuned by PSO.To take a better comparison, CWDC is employed as the cost function of both methods.
The iterative processes of PSO training of the two methods in the four cases are shown in Figure 3.As can be seen in Figure 3, the proposed method converges faster than RI-ELM and it indicates that pretraining method accelerates the convergence speed of PSO.Test results of the four cases are shown in Table 1.It is demonstrated that the proposed method can generate PIs with higher performance compared to RI-ELM.The reason is that proposed method gives more reasonable output values of ELM in the initialization stage.

Comparison of ELM Based Methods Using CWC and CWDC.
In order to test the effectiveness of the proposed criterion CWDC, the ELM using CWC (ELM-CWC) is compared with the proposed method.For the superior performance of pretraining based ELM, it is used to construct PIs.The results of ELM-CWC and the proposed method are summarized in Table 2.As shown in Table 2, the proposed method achieves lower PINAWs and smaller PINADs with valid PICPs for all four cases.Because the deviation of the PIs is considered in the objective function, PIs with higher performance is obtained.
PINAD is a new index that is considered in CWDC and is proposed to show the deviation PIs.PINAD and PICP have some correlation.A higher PICP implies that more prediction points lie in the constructed PIs, which may lead to a smaller PINAD.However, the PICP can only represent the proportion of the prediction points lying in the constructed PIs, while the PINAD can show the deviation degree of the data that do not lie in the constructed PIs.Thus, the PINAD is a necessary supplement to the PICP.

Comparison Experiment with Conventional Methods.
PIs obtained by the proposed method of the four case studies are illustrated in Figure 4.This figure visually demonstrates that the constructed PIs cover the real targets in a high percentage.
As shown in Table 3, for Cases 1 and 2 the proposed method obtains the narrowest PINAWs and smallest PINADs with satisfactory PICPs (≥90%).The median PINAWs of the proposed method are 62.14% for Case 1 and 29.64% for Case 2, which are the narrowest for both cases.Additionally, the median PINADs of the proposed method are 0.61% for Case 1 and 0.27% for Case 2, which are smaller than the results of other methods.As reported in Table 3, the median PINAWs and PINADs of the proposed method for Cases 3 and 4 are lower than the results of the bootstrap method.It should also be noted that the proposed method does not provide the highest PICP and the lowest PINAD in these two cases.Relative to the Bayesian method, the PICPs of the proposed method are 3.33% and  2% lower in Cases 3 and 4, respectively, and the PINADs are 0.14% and 0.03% higher in cases 3 and 4, respectively.However, the proposed method generates much narrower PIs, and the PINAWs are 7.72% and 4.02% narrower than the Bayesian method in Cases 3 and 4, respectively.That finding implies that the PIs constructed by our method are more informative than those of the Bayesian method with satisfactory PICP (≥90%).

Prediction of the BFG Flow
In this section, the proposed method is used to solve a realworld problem; in this case, the method is used to handle the uncertainty of blast furnace gas (BFG) generation and consumption prediction in the iron and steel industry.

Datasets.
BFG is an important type of byproduct gas from the steel production process; it is used for coking, sintering, hot-rolling, and other production processes.The historical datasets of two typical BFG generation and consumption processes are chosen as case studies to illustrate the advantages of the proposed method in practical applications.The first dataset is BFG generation.The second dataset covers BFG consumption in a hot-rolling process.Both datasets come from the energy management center of a steel plant in China, Jiangsu Province.The sample interval is 5 min, and the total number of sampling points in each dataset is 288.The dataset is divided into two parts: the training set (188 points) and the test set (100 points).One-step-ahead forecasting using the proposed method is implemented and compared with the other three methods.

Determination of the Model Structure and Parameters.
Inputs of an NN are usually determined based on the characteristics of the system.BFG generation and consumption prediction can be seen as a time series prediction problem.The autocorrelation function (ACF) and partial autocorrelation function (PACF) are adopted as the criteria to determine the time lags of the input.The ACF and PACF of the BFG generation are displayed in Figures 5 and 6.The ACF has a decaying tendency, while the PACF shows a cutoff pattern after lag 3.As a result, the model order of this time series is 3. Similarly, the suitable order for the BFG consumption in the hot-rolling process is chosen to be 4.The number of hidden nodes of the ELM is selected based on the 5-fold validation method.According to the experiments, the number of hidden nodes for the blast furnace and hot-rolling process are chosen to be 12 and 18, respectively.
For these two cases, the parameters of the PSO are assigned as follows:  1 = 0.5,  2 = 1.5, and  = 0.99.The population size is set to 100, and the maximum number of iterations is assigned to be 400.To further validate the capacity of the proposed method, PINC is set to three different values, namely, 90%, 95%, and 99%, and thus,  = 0.1, 0.5, and 0.01, respectively.For the CWDC, the parameters are set as follows:  = 90%, 95%, and 99% and  = 20 and  = 50.4 and 5, respectively.According to Tables 4 and 5, the following conclusions can be drawn.
(1) Relative to the Bayesian method, the proposed method provides satisfactory PICPs at all confidence levels in both cases.For example, at the confidence level with PINC = 90%, the PICPs of the PIs constructed by the proposed method are 91% for the blast furnace and 91% for the hot-rolling process.are 87% for the blast furnace and 89% for the hotrolling process, which are lower than the PINC.This result occurs because the Bayesian method makes assumptions about the error distributions, but the assumptions cannot always be fitted to the facts.The performance of the proposed method is clearly better than that of the Bayesian method.(2) In both case studies, the proposed method provides better performance than the bootstrap method.Taking the case of the blast furnace as an example, the proposed method constructs PIs with PINAWs that are 2.80%, 3.08%, and 4.58% lower than those of the bootstrap method at the three confidence levels.This observation shows that the proposed method generates more informative PIs than the bootstrap method.
(3) The PINAWs of the PIs constructed by the proposed method are narrower than those of the LUBE method, and the PINADs of the PIs constructed by the proposed method are also much lower than those of the LUBE method.For example, the proposed method constructs PIs with PINADs that are 2.80%, 3.08%, and 4.58% lower than those of the LUBE method in the case of the hot-rolling process at the three confidence levels.The reason is that the CWDC proposed in this study considers the deviation of the PIs.
According to the results of the analysis, all four methods can be used to construct PIs for byproduct gas generation and consumption forecasting.The PIs generated by the four methods in most cases have valid PICPs and reasonable PINAWs and PINADs.However, it is clear that the proposed approach constructs higher quality PIs than the other three conventional methods.Thus the proposed method is an effective way to improve the quality of PIs for byproduct gas flow forecasting applications.

Conclusions
To quantify the uncertainties of byproduct gas forecasting, a two-step approach based on a twin ELM is proposed to construct high quality PIs.The twin ELM is first pretrained using a pair of symmetric weighted objective functions.Then, the output weights are further adjusted by PSO.During the procedure, a novel objective function, which accounts for the PI coverage probability, width, and deviation, is constructed to obtain optimal PIs.The effectiveness of the proposed method has been successfully verified through tests and comparisons with well-established methods using both benchmark datasets and real-world byproduct gas datasets.Demonstrated results show that not only valid PICPs and narrow PINAWs are obtained, but also small PINADs are achieved.The results have illustrated that the proposed method constructs high quality PIs for byproduct gas forecasting application.The constructed PIs can be applied to gas system scheduling for making reliable decisions.
Although we have only verified the effectiveness of the proposed method in iron and steel plants, it could be easily extended to other fields such as wind speed and electric load forecasting.Furthermore, as with other applications of ELM, the performance of the proposed method might be improved though the combination of the proposed method with structure selection strategy, for example, pruning strategy [32].Some other novel evolutionary algorithms could be applied to enhance the quality of constructed PIs.

Figure 1 :
Figure 1: The overall structure of the proposed method.

Figure 2 :
Figure 2: Flowchart of the proposed method.

( 4 )
Update the optimal values.The local best position  best  and the global best position  best  are updated through comparing the fitness values.

Table 1 :
Comparisons of proposed method and RI-ELM.

Table 2 :
Comparisons of proposed method and ELM-CWC.

Table 3 :
Test results in the four benchmark datasets.

Table 4 :
Comparisons of different methods for BFG generation.
The PICPs of the PIs constructed by the Bayesian method Mathematical Problems in Engineering

Table 5 :
Comparisons of different methods for BFG consumption in the hot-rolling process.