Prediction of Bending Force in the Hot Strip Rolling Process Using Multilayer Extreme Learning Machine

,


Introduction
In hot strip rolling (HSR) process, the product quality of the strip is mainly contributed by dimensional accuracy, mechanical properties, and surface properties. Here, the dimensional accuracy of the strip requires two important indicators: strip thickness and surface profile [1]. e surface profile of the strip is defined as the difference of thickness between the center and a point of 40 mm from the edge of the strip, or in other words, this is the difference of thickness across the width of the strip [2]. ere are many factors that affect the surface profile of the strip, which are mainly related to the roller, strip, and rolling conditions in the HSR process [3]. e hydraulic bending roll control is one of the main methods for hot strip profile control. By applying a certain bending force on the work roll, the strip profile can be improved. e schematic diagram of the hydraulic roll bending technique is shown in Figure 1. e hydraulic roll gap control device can be used as the actuator of the leveling operation. How to set the appropriate bending force, which can affect the strip head and tail end walking behavior and strip profile, is the urgent problem to be solved. erefore, the prediction accuracy of the roll bending force directly affects the strip profile control accuracy. e prediction accuracy is high, which is beneficial to the closed-loop feedback control of bending force. In the actual production process, the bending force is calculated according to the requirements of temperature, thickness, width, rolling force, material, roll thermal expansion, roll wear, and target profile [4].
Generally, the initial value setting of bending force is very complicated and is a multivariable optimization problem. At the same time, the adjustment of bending force needs to be calculated according to the strip profile at the exit. Some rolling parameters related to the bending force model have the characteristics of nonlinearity, strong coupling, large detection error, etc. erefore, the mathematical model established by the traditional theory has the disadvantages of slow response speed and low control accuracy in production practice. All these problems seriously restrict the further improvement of strip profile control accuracy. With the continuous improvement of rolling speed, the rolling process control system should be improved to adapt to the changes of the rolling process and the improvement of product requirements. erefore, the high-precision prediction method of rolling process based on industrial big data has attracted attention [5].
Artificial neural network (ANN) is the earliest and most widely used data-driven method. ANN has been widely used in the field of metallurgy and materials because of its ability of parallel information processing, function approximation, self-learning, self-adaptation, and fault tolerance [6][7][8]. In the HSR process, ANN was proposed to predict the roll force [9][10][11][12], the strip profile and flatness [2,[13][14][15], and mechanical properties [16][17][18]. With the continuous development of ANN prediction in HSR, more application backgrounds have been continuously developed. For the first time, Wang et al. [4] have established an ANN model to predict the bending force of strip. To improve the performance of the model, it has also been optimized with a genetic algorithm (GA) and a GA-ANN model with good prediction performance has been obtained. e ANN method improves the performance of prediction to a certain extent; there are still some problems, such as the slow learning speed of the algorithm for training models and the difficulty of adjusting numerous parameters. erefore, the new machine learning algorithm is applied to various prediction problems in the rolling field. Huang et al. [19] proposed a new learning algorithm called extreme learning machine (ELM), the single-hidden layer feedforward neural networks (SLFNs), which randomly chose the input weights and determined the output weights of SLFNs by least-squares solution. Compared with the traditional gradient-based approaches, the ELM has a significantly faster learning speed and presents better generalization performance. ELM has been widely used in metallurgy and metal processing fields, such as the mechanical properties of hot rolling products [20], rolling force prediction in HSR [21], silicon content prediction [22], alumina concentration detection [23], gas utilization ratio prediction in blast furnace [24], and tool fault diagnosis in numerical control machines [25].
In this paper, ELM is first applied to predict the bending force of the strip in HSR. Furthermore, three improved ELM-based models are developed to realize the high precision and high stability prediction of bending force. Section 2 introduces the theory and optimization strategies of the ELM model. Section 3 shows the experimental data, prediction results, and discussion. Finally, the conclusion is drawn in Section 4.

ELM.
e basic theory of the ELM model states that, for . , t im ] T ∈ R m are the model input and the target data, standard SLFNs with N hidden layer nodes and an activation function g(x) are mathematically described as follows: where w i � [w i1 , w i2 , . . . , w in ] T is the weight vector connecting the ith hidden node and the input nodes, ] T is the model output, and b i are the bias parameters of the ith hidden node. w i · x j denotes the inner product of w i and x j . e w i and b i parameters can be randomly assigned if the activation function is infinitely differentiable, so only the β parameters need to be optimized when minimizing the mean squared error between the model output and the target data. In detail, the standard SLFNs with N hidden nodes with activation function g(x) can approximate these N samples with zero error means that N i�1 ‖o i − t i ‖ � 0. erefore, in the ELM approach, training an SLFN is equivalent to simply finding the least-squares solution of the linear system, which can be written compactly as follows: where the hidden layer output matrix H is the output weight matrix β and the target data matrix T are   Mathematical Problems in Engineering Algebraically, the linear system for β is solved via the Moore-Penrose generalized inverse H − 1 .
e principle which distinguishes ELM from the traditional neural network methodology is that the parameters of the feedforward networks (input weights and hidden layer biases) are not required to be tuned in the former. e studies of Tamura and Tateishi [26] and Huang et al. [27] showed in their works that the SLFNs with randomly chosen input weights efficiently learn distinct training examples with a minimum error. After randomly choosing the input weights and the hidden layer biases, SLFNs can be simply considered as a linear system. e output weights which link the hidden layer to the output layer of this linear system can now be analytically determined through a simple generalized inverse operation of the hidden layer output matrices. is simplified approach makes the ELM model many times faster than that of ANN [27]. Figure 2 shows the basic schematic topological structure of an ELM network.

ELM Optimization by GA (GAELM).
Because the initialization of the weights and biases is randomly assigned in the ELM algorithm, it will not lead to the optimal state of the network during the training process. erefore, in this study, we introduce the GA algorithm to optimize the weights and biases of the ELM network. GA is a parallel random search optimization method based on the natural genetic mechanism and natural selection theory in the biological world, and it is very suitable for complex nonlinear optimization problems [28]. e GA selects, crosses, and mutates the operator according to a random initial set of solutions and iteratively generates new solutions. After a certain number of algebras, a global optimal solution is obtained. e select operator means selecting individuals with a strong vitality in the group. e roulette wheel selection method is adopted, and which formula is as follows: where F i is the fitness value of ith individual and N is the number of population. e cross operation in GA is that two pairs of chromosomes (individuals) are exchanged with each other in some way. e cross operation of the node A k of the kth chromosome and the node A l of the lth chromosome at the j position is as follows: where η is a random number between 0 and 1. Cross operation is shown in Figure 3.
e mutate operation is achieved by flopping the randomly selected bits (see Figure 4), and the mutate probability p m is usually small. e selection of an individual in a population is carried out by the evaluation of its fitness, and it can remain in the new generation if a certain threshold of fitness is reached. e Individuals with higher fitness are more likely to reproduce. Mutation operations are as follows: where A max is the upper bound of A ij ; A min is the lower bound of A ij ; G is the current iteration time; G max is the maximum time of evolution; and r 1 andr 2 are the random number between 0 and 1. e overall algorithm process for optimizing ELM by using GA is shown in Figure 5.

GA with Simulated Annealing (SAGA).
In the early stage of traditional GA, the individual difference is large. When the classic roulette method was used, the number of new individuals is proportional to the fitness of the original individuals. New individuals are easily flooded to the whole population, which causes early maturity. In the later stage, the fitness tends to be consistent, and the superior individuals do not have obvious advantages when they produce new individuals, which stops the evolution of the whole population [29]. erefore, it is necessary to properly stretch the fitness. e simulated annealing (SA) algorithm proposed by Metropolis can realize the stretching of the fitness function [30]. In this study, SA was used to optimize the selection process of GA. e SA algorithm mainly includes the Metropolis criterion, that is, the probability of accepting the new solution of SA. As for the optimization problem of taking the minimum value for the objective function, the probability that the SA accepts the new solution is as follows: where x represents the current solution; x ′ represents the new solution; f(x) represents the objective function value of the current solution; f(x ′ ) represents the objective function value of the new solution; and T represents temperature.

An Improved Multilayer ELM.
Already in 1997, the results of Tamura and Tateishi showed that with the increase of network layer, the prediction accuracy will become higher and higher, and the number of hidden layer nodes will decrease greatly [26]. Qu et al. [31] proposed a two-layer ELM (TELM) neural network. e structure of the TELM is composed of an input layer, two-hidden layers, and an output layer, and the neurons between the layers are all connected. e TELM still retains some of the advantages of the ELM, such as strong generalization ability, fast operation speed, and very little chance of falling into overfitting. e principle of the TELM is as follows: Randomly generate the first hidden layer input matrix weights W and biases B. e first hidden layer input parameter matrix is defined as W IE � [BW]. And the augmented matrix of the input matrix is defined as X E � [1X] T . Calculating the output matrix of the first hidden layer as e output weights matrix between the first hidden layer and the final output layer according to the traditional ELM are calculated as e expected output matrix of the second hidden layer is calculated as e augmented matrix of the first hidden layer output matrix H E � [1H 1 ] is constructed and the second hidden layer input parameters matrix is calculated as e output matrix of the second hidden layer is calculated as e output weight matrix between the second hidden layer and the final output layer is calculated as e final output of the network is where the superscript − 1 represents the Moore-Penrose generalized inverse operation; g − 1 represents the inverse function of g. e basic schematic topological structure of a TELM network is shown in Figure 6.
To get a more stable prediction output and better generalization performance, we calculate the output weight matrix β of TELM by considering the following three cases: If the number of training samples is more than the number of hidden layer nodes, then If the number of training samples is less than the number of hidden layer nodes, then If the number of training samples is equal to the number of hidden layer nodes, then where λ is a uniformly distributed random number between 0 and 1 [32]. And this method is named as improved TELM (ITELM). Based on these tasks, we optimized the ITELM network further by using the SAGA algorithm. We call it the

Experimental Results and Analysis
3.1. Data and Processing. In this paper, the final stand rolling data of a 1580 mm HSR process in a steel factory are collected for experiments. e input variables used for the proposed prediction model of bending force are entrance temperature (°C), entrance thickness (mm), exit thickness (mm), strip width (mm), rolling force (kN), rolling speed (m/s), roll shifting (mm), yield strength (MPa), and target profile (μm). e output variable of the model is the bending force (kN). A total of 1300 pieces of steel data are employed in the experiments, and the dataset is divided into the following two subsets: training set (70%) and testing set (30%). e training dataset is used in determining the model structure and selecting training parameters. In this paper, the testing dataset is used as an unseen validation dataset to verify the model performance. e fractal dimension visualization diagram of the collected dataset is shown in Figure 8. Clearly, the input data vary considerably in different dimensions.          Mathematical Problems in Engineering error increase because of the big difference between input and output data and update the weights and biases conveniently in the modeling process. It is necessary to scale the data to a small interval in a certain proportion. Normalization is required prior to data entry into the model [33]. e following formula is used to normalize the data: where y i ′ , y i , y min , and y max are the normalized data, original data, maximal data, and minimal data, respectively.

Evaluation Criteria.
Complete assessment of model performance is carried out by calculating the mean absolute error (MAE), root-mean-squared error (RMSE), and mean absolute percentage error (MAPE) on the testing dataset. e formulas of the three evaluation criteria are described as follows: where n denotes the number of sample data and y i and x i are the measured value and the predicted value of the ith sample.

Determining the Best Configuration for ELM.
ELM can fit nonlinear complex functions well, mainly because the activation function is used in the hidden layer. e activation function plays an important role in learning the model and understanding the very complex and nonlinear relationships. It can learn complex arbitrary function mappings that represent nonlinearities between input and output. For the commonly used activation functions "Radbas," "sin," "sigmoid," "Hardlim," and "Tribas," this paper tests their impact on the performance of the prediction model. e formulas for these activation functions are as follows: Besides the activation function, the number of hidden layer nodes also plays a crucial role in the accuracy and generalization ability of the model. If there are too many hidden layer nodes, it will inevitably cause some initial node units to be invalid or redundant, which will greatly affect the generalization ability of the model. If the node is set too small, the accuracy of the model will be affected. We must reasonably control the number of nodes to reduce the generation of redundant nodes and ensure the prediction accuracy of the model. e results in Table 2 fully reflect the RMSE of activation functions and the number of hidden layer nodes of the ELM algorithm. Table 2 shows that the RMSE value corresponding to the "sigmoid" activation function is always the smallest in the case of the same number of hidden layer nodes. Table 2 also shows that when "sigmoid" is used as the activation function, the RMSE decreases first and then increases with the increase of the number of hidden layer nodes, which is consistent with the previous description. Although the RMSE of activation functions "Radbas," "Hardlim," and "Tribas" are still decreasing with the increase of the number of nodes, they are always larger than the RMSE of activation function "Sigmoid" when the number of hidden layer nodes is 90. For the three activation functions, when the number of hidden layer nodes is large enough, there may be a relatively small RMSE, but the increase of the number of hidden layer nodes means the increase of model complexity. erefore, "sigmoid" is finally proposed as the activation function and 90 is the number of hidden layer nodes.

ELM Optimized by SAGA.
Among the SAGA parameters that must be determined, population size and crossover probability are discussed in this section. Large population size may lead to slow convergence speed, while a small population size can involve a local optimum point. Crossover probability is employed to determine if two individuals must be crossed or not. With the results shown in Table 3, the population size of 10 generates the lowest RMSE of 12.3072. e effect of crossover probability from 0.4 to 0.9 at intervals of 0.1 is listed in Table 4, and the optimal crossover probability is 0.7. Based on the results of the extensive and complex experiments carried out above, a convincing ELM model optimized by SAGA (SGELM) is finally established, and the parameters of each variable are listed in Table 5. Figure 9 shows the optimization procedures of the algorithms including GAELM and SGELM. It can be observed that the fitness curve of GAELM completely converges after the 60th iteration, while the fitness curve of SGELM just starts to converge from the 86th iteration. However, it can also be observed that the final fitness value of SGELM is better than that of GAELM. It indicates that, although the convergence rate of SGELM is slow, the solution quality of SGELM is better than that of GAELM.

Determining the Best Configuration for TELM.
e number of nodes in each hidden layer plays an important role in the capacity of TELM, and there is no accepted theory to determine it. In this study, the activation function with "sigmoid" is determined first and the number of nodes in each of the two hidden layers is set to be the same [32]. Based on the results shown in Table 6, a TELM with the hidden layer structure 40-40 has the lowest RMSE of 12.9550.

Comparison with Different ELM-Based Methods.
e prediction results of ELM, GAELM, SGELM, SGITELM, and real bending force values are shown in Figure 10. For better visualization, only the predicted values for the 40 samples in the testing dataset are shown. As can be seen from Figure 10, the bending force values predicted by the four ELM-based models are relatively close to the measured bending force values. It shows that the four ELM-based models can be applied to the prediction of bending force in HSR. In addition, among the four models, ELM has relatively poor prediction performance, because its prediction value is farthest from the measured value. e prediction performance of the ELM, GAELM, SGELM, and SGITELM is represented by the scatter plot, as shown in Figure 11. e color scale is used to grade the absolute error; as the color goes from red to blue, the absolute error increases from 0 to 25 kN and the pink spot symbols indicate an error of over 25 kN. For higher production requirements, the absolute error between the predicted and measured bending force is expected to be less than 25 kN. erefore, the more pink spots the model has, the worse the prediction performance the model has. In Figure 11, the predicted values are evenly distributed on both sides of the diagonal line symmetrically. Among them, ELM has the largest number of pink spots and the SGITELM has the least number of pink spots; 98.72% of the absolute errors of SGITELM prediction results are less than 25 kN. It also       shows that the prediction performance is improved after the model is optimized. Reasonable error distributions of the model are of great significance to analyze the feasibility of the model. Figure 12 is the histograms and distribution curves of the errors from the ELM, GAELM, SGELM, and SGITELM. e error distribution curves have a bell shape of normal distribution, which indicates that the prediction errors of all models are normal distribution. SGITELM performs relatively well, and their normal distribution curves are higher and narrower with the smallest standard deviation σ, which indicates that more prediction values with smaller errors are obtained.
ese results further prove the superiority of the prediction performance of the SGITELM algorithm. Figure 13 shows the boxplot for the relative error results of the proposed methods. In general, the relative error reflects the deviation between the predicted value and the measured value, which better reflects the reliability of the prediction performance than the absolute error. e boxplot represents the degree of spread for the relative error with its respective quartile. SGITELM has the least outliers and the smallest quartile, which indicates that SGITELM has advantages over other bending force prediction methods.
To evaluate the prediction accuracy of these ELM-based methods more intuitively, the results of the three evaluation criteria are given in Figure 14. e results show that the accuracy ranking results of the four models under MAE, MAPE, and RMSE are the same; the evaluation criteria in ascending order is ELM > GAELM > SGELM > SGITELM. e proposed SGITELM has the best prediction performance, and the MAE, MAPE, and RMSE of SGITELM are 9.0893, 1.1433%, and 11.2678, respectively. ese results fully prove that the proposed SGITELM is more suitable for bending force prediction of the hot strip than traditional ELM methods because of its higher prediction accuracy and better generalization performance.

Conclusion
In this paper, four ELM-based methods to predict the bending force were proposed. A total of 1300 pieces of steel data were collected to train and test the models. Values of hidden layer nodes, activation functions, and hidden layer structure of ELM were separately tested to determine the structure of ELM. e prediction performance of ELM, GAELM, SGELM, and SGITELM was evaluated, and the prediction accuracy was compared with the three criteria of MAE, MAPE, and RMSE. e prediction accuracy of the ELM-based methods can be significantly influenced by different hidden nodes, activation functions, parameter settings of SAGA, and the hidden structure of the TELM. e improved SGITELM is the most recommended method, which has the highest prediction accuracy and the best generalization performance and can be recommended for bending force prediction in hot strip rolling. ELM-based methods can work well for bending force prediction in hot strip rolling. It is also recognized that if ELM-based methods are introduced into other predictions in the hot rolling industry, more production benefits and economic benefits may be obtained.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.