^{1}

^{2}

^{2}

^{3}

^{1}

^{2}

^{2}

^{3}

^{2}

^{3}

^{1}

^{2}

^{3}

In order to explore a dynamic prediction model with good generalization performance of the content of [Si] in molten iron, an improved SVM algorithm is proposed to enhance its practicability in the big data sample set of the smelting process. Firstly, we propose a parallelization scheme to design an SVM solution algorithm based on the MapReduce model under a Hadoop platform to improve the solution speed of the SVM on big data sample sets. Secondly, based on the characteristics of stochastic subgradient projection, the execution time of the SVM solver algorithm does not depend on the size of the sample set, and a structured SVM algorithm based on the neighbor propagation algorithm is proposed, and on this basis, a parallel algorithm for solving the covariance matrix of the training set and a parallel algorithm of the

Blast furnace anterograde and hot metal quality are the primary goals of iron-making process control. The silicon content of hot metal is an important characterization parameter for slag quality, tapping temperature, and hot metal quality. The fluctuation degree of silicon content is also an important parameter for the operation of a blast furnace. Therefore, it is necessary to construct a precise dynamic prediction model of molten iron content in molten iron. The support vector machine (SVM) algorithm is a widely used machine learning algorithm. The core idea of this algorithm is the maximum interval theory [

Based on the bottleneck of the SVM solution algorithm, relevant scholars have developed the parallelization strategy of the algorithm according to the iterative principle and convergence speed of the algorithm. In many parallel computing environments, cloud computing as a mainstream parallel computing research platform is extended by distributed computing [

SVM, as a machine learning algorithm with a solid foundation of mathematical theory and excellent generalization performance, should use the Hadoop cloud computing platform and parallel computing to break through its bottleneck in the processing of big data sets and promote the application scope of the algorithm.

Xiaole et al. [

The stochastic subgradient projection algorithm is a typical algorithm for solving convex quadratic programming problems, which has a strong representativeness when it is used for solving SVM algorithms. Based on the deep analysis of the SVM solution process and stochastic subgradient projection algorithm, a parallel SVM algorithm using the stochastic subgradient projection algorithm and considering the structure of sample data is designed in this paper, with the help of the MapReduce model on the Hadoop cloud computing platform. The algorithm is applied to deal with the big historical data produced in the process of blast furnace production in order to obtain the efficient dynamic prediction model of [Si] in molten iron.

Hadoop, one of the distributed system infrastructure frameworks, has the advantages of high reliability, high efficiency, and freedom of scalability. It is a software platform for distributed processing of big data [

The core idea of MapReduce distributed computing technology [

The MapReduce programming model takes the key-value pair as its input form, and its execution process can be seen as a key-value pair conversion to another batch-value pair output process [

Process flow of MapReduce.

HDFS is the foundation of data storage management in distributed computing, which has the advantages of high reliability, strong expansibility, and throughput. The premise and goal of the system design are as follows [

When HDFS is running on normal hardware, each component may be faulty, so error detection and quick recovery are the core goals of HDFS.

The method of data reading on HDFS is massive data stream, so it requires high throughput when accessing data.

Big-scale data sets require HDFS to support large file storage and can provide higher data transmission bandwidth.

HDFS requires write once and read many times access model for documents, which simplifies data to meet high-throughput data access.

The cost of mobile computing is lower than that of mobile data. A computing task is close to its operation data, which can reduce network congestion and improve system throughput.

Heterogeneous hardware and software platforms should be portable.

HDFS uses master-slave architecture (see Figure

Structure sketch map of HDFS.

HSDS uses a file access model that writes one time but can be read multiple times, which makes HDFS data access with high throughput. In addition, HDFS also has a variety of reliability assurance measures, and it is stored in a number of copies of data to ensure data stability, also through data mode heartbeat detection, security mode, block reporting, and space recycling mechanism to ensure reliability.

The support vector machine is a machine learning algorithm based on the VC dimension theory and structural risk minimization principle in statistical learning [

If there is a hyperplane which can correctly divide the sample sets of +1 and −1, the data set is called linear separable data set. The purpose of applying SVM is to find two hyperbranes with the largest class spacing based on the distance between the two classes. The hyperplane can be described by

Linearly separable case of an SVM optimal hyperplane.

Based on the idea of the maximum interval theory, the SVM model can be described by Formula (

In Formula (

The optimal decision function of SVM can be described by Formula (

The sample set may have noise or other factors in the acquisition process, resulting in the sample set category being unable to be completely divided by a hyperplane [

The purpose of introducing the penalty item is to enhance the fault tolerance of the SVM classifier. The larger the penalty factor is, the greater the penalty is. Equation (

If the data sample set cannot be linearly separable, the original sample vector can be mapped from the original space to the higher-dimension feature space with a nonlinear function as shown in Figure

Nonlinear case of a schematic diagram of the SVM kernel function.

The problem of solving SVM is essentially a quadratic programming problem. Related scholars had done a lot of research on the quadratic programming problem. Among them, the most typical ways are the Internal point method and the decomposition method.

The Internal point method [

The decomposition method [

The Pegasos is a gradient-based algorithm that can be applied to the direct solution to the original problem of SVM. The random gradient descent and the projection step are alternately performed in the iteration, and a number of samples are taken from the whole training sample to calculate the subgradient of each round.

The SVM problem is described as

In formula (

Formula (

So

The Pegasos requires that the number of the iterations is

The Pegasos algorithm is described as follows:

Input: Training data set

Output: The normal vector of a hyperplane

Main procedure:

1. Initialization vector w, arbitrarily select a vector _{1}, and request

2. For

2.1 Select k samples from the training set S, subset

2.2 Determine the learning efficiency

2.3 The use of

The sub-gradient direction of the objective function can be expressed as:

2.4 Update:

2.5 Projection steps:

3. Get the final result

The convergence of the Pegasos algorithm based on the SVM solution does not depend on the number of samples ^{5} and 10^{6}. Therefore, in dealing with big-scale samples and in each round of the iterative process, usually by selecting as many samples as possible reduces the number of iterations required for the algorithm. According to the above contents, the key steps of parallelizing the Pegasos algorithm are as follows [

The covariance matrix of the data contains the trend of data generation in the statistical sense, which can effectively reflect the result information of the data. The clustering information of data is important information for the structure of response data. This page uses the affinity propagation clustering algorithm [

The clustering mechanism of the AP clustering algorithm is “message passing.” By passing messages between data points, the final cluster center can be identified. Involving two important parameters, the degree of attractiveness parameters and attribution parameters, respectively, the larger the former, the greater the likelihood that the candidate cluster center becomes the center of the actual cluster. The larger the latter, the greater the likelihood that the sample point belongs to the cluster center. The actual message passing mechanism is implemented by iteratively updating the attractiveness matrix

The attribution matrix

The membership matrix

In Formulas (

According to the basic idea of the AP clustering algorithm, in order to be able to effectively deal with big-scale data clustering problems, this paper designs a distributed AP based on MapReduce. For the implementation of the algorithm (see Figure

Implementation of distributed AP clustering based on MapReduce.

The model of structured SVM can be described by

In formula (

In the Pegasos algorithm,

If the optimal solution of (

The above analysis shows that the data structural information is embedded into the Pegasos algorithm, and the optimal solution of the structured Pegasos algorithm is found in set B (see Formula (

Algorithm

The Pegasos algorithm for structured SVM is described as follows:

Input: Training data set

Output: The normal vector of the classification hyperplane is

Main procedure:

1. Calculate the covariance matrix

2. Initialize vector

3. For

3.1 Select the subset of the

3.2 Determine the learning efficiency of the gradient descent method

3.3

The sub-gradient direction of the objective function can be expressed as:

2.4 Update:

2.5 Projection steps:

4. Get the final result

Based on the Pegasos algorithm of structured SVM, the parallel processing of the structured Pegasos algorithm is implemented on the Hadoop platform by means of a MapReduce parallel framework model. The algorithm is divided into two stages which are the covariance parallel calculation phase of the data samples and the subgradient projection iterative parallel phase of the data samples. There is a separate MapReduce task for each iteration in the two stages of the calculation process.

When the sample data structural information is obtained under the MapReduce framework model, the training samples must be scattered on the corresponding data nodes to solve the covariance matrix of the data samples scattered on the corresponding data nodes. The training set

The covariance matrix on the training set

The simplified form of formula (

With the aid of Formula (

//Map

Input: The sample on the current node;

Output:

Main procedure:

1. Scan the current node sample, accumulate the number of samples of the current node

2. Calculate

3. Calculate

//Reduce

Input: Each node

Output:

Main procedure:

1. Summarize the output of the Map node;

2. Find the covariance matrix

In conjunction with Algorithm

//Map

Input:

Output: The current node obtained

Main procedure:

1. Randomly selected

2. Define the zero vector

3. For

If

Then

4. Solving get

//Reduce

Input: The current number of iterations

Output:

Main procedure:

1. Calculate

2. Calculate

3. Calculate:

4. Calculate:

In this study, the deep excavation of big-scale historical production data of the No. 1 blast furnace (3200 m^{3}) of Tangshan Iron and Steel Company in the period from December 1, 2015, to November 30, 2016, was carried out to construct the [Si] content of molten iron during the smelting process for the prediction model.

Sample output variable data collection: a total of 26 thousand samples of hot metal samples were collected during the iron cycle. To build a dynamic prediction model for the [Si] content of molten iron in the smelting process, the [Si] content of the 26 thousand molten iron collected was calculated according to the time series.

Sample input variable data collection: the factors affecting the [Si] content of molten iron in the tap hole are sample input variables. Through the reaction mechanism and control mechanism in the process of blast furnace iron-making, the action diagram of the factors affecting the [Si] content of molten iron is shown in Figure

Action diagram of influencing factors of [Si] content in molten iron.

In this study, the deep excavation of big-scale historical production data of the No. 1 blast furnace (3200 m^{3}) of Tangshan Iron and Steel Company in the period from December 1, 2015, to November 30, 2016, was carried out to construct the [Si] content of molten iron during the smelting process for the prediction model. [Si] content means mass fraction.

The sample set is composed of sample input and sample output. Based on the reaction mechanism and control mechanism in the blast furnace iron-making process, the 24 indicator factors shown in Figure

List of results of alternative input indicator and [Si]% gray relational degree solving.

Variable symbol | Gray correlation | Is larger than a threshold |
---|---|---|

X1 | 0.8702 | ✔ |

X2 | 0.8861 | ✔ |

X3 | 0.8860 | ✔ |

X4 | 0.8641 | ✘ |

X5 | 0.8737 | ✔ |

X6 | 0.8862 | ✔ |

X7 | 0.8772 | ✔ |

X8 | 0.8689 | ✘ |

X9 | 0.8864 | ✔ |

X10 | 0.8852 | ✔ |

X11 | 0.8860 | ✔ |

X12 | 0.8896 | ✔ |

X13 | 0.8811 | ✔ |

X14 | 0.8814 | ✔ |

X15 | 0.8666 | ✘ |

X16 | 0.8804 | ✔ |

X17 | 0.8199 | ✘ |

X18 | 0.8195 | ✘ |

X19 | 0.8842 | ✔ |

X20 | 0.8827 | ✔ |

X21 | 0.8848 | ✔ |

X22 | 0.8853 | ✔ |

X23 | 0.9160 | ✔ |

X24 | 0.8832 | ✔ |

In formula (

In Formula (

The blast furnace operation process is a large delay process, corresponding to the time synchronization which corresponds to the input index and [Si]% data between the existence of a large delay. In order to improve the accuracy of the prediction model, it is necessary to modify the order of the delay between the input and the output to construct the most relevant sample set. In this study, the correlation coefficient analysis method is used to calculate the influence degree of input variables on [Si]% at different time delays. Among them, the time series is set to 0T, 1T, 2T, 3T, 4T, and 5T in terms of the hot metal cycle. The smelting period T of the No. 1 (3200 m^{3}) blast furnace in Tangshan Iron and Steel II Factory is 6 ~ 8 hours. The correlation coefficient calculation results are shown in Table

Different time delay correlation coefficients under the input variables and the calculation results of the [Si]% list.

Input variable the previous |
||||||
---|---|---|---|---|---|---|

0.2360 | 0.2044 | 0.1751 | 0.1316 | 0.0897 | ||

0.0167 | 0.0078 | 0.0836 | 0.0530 | 0.0559 | ||

0.0140 | 0.0099 | 0.0847 | 0.0542 | 0.0554 | ||

0.4362 | 0.3407 | 0.2581 | 0.2073 | 0.1532 | ||

0.0164 | 0.0077 | 0.0834 | 0.0534 | 0.0578 | ||

0.2652 | 0.2740 | 0.2415 | 0.1498 | 0.1204 | ||

0.1190 | 0.1033 | 0.0994 | 0.0969 | 0.0764 | ||

0.2886 | 0.3512 | 0.3058 | 0.3123 | 0.3282 | ||

0.3258 | 0.3580 | 0.3139 | 0.3113 | 0.3077 | ||

0.3631 | 0.4234 | 0.3810 | 0.3659 | 0.3596 | ||

0.3655 | 0.4163 | 0.3661 | 0.3657 | 0.3634 | ||

0.0849 | 0.0562 | 0.0951 | 0.0470 | 0.0420 | ||

0.0239 | 0.0194 | 0.0324 | 0.0238 | 0.0255 | ||

0.0166 | 0.0626 | 0.1315 | 0.0403 | 0.0418 | ||

0.0126 | 0.1462 | 0.1105 | 0.1133 | 0.1081 | ||

0.0196 | 0.0499 | 0.1233 | 0.0867 | 0.0903 | ||

0.0223 | 0.0532 | 0.1281 | 0.0886 | 0.0906 | ||

0.5360 | 0.4059 | 0.2698 | 0.1854 | 0.1124 | ||

0.2578 | 0.2852 | 0.2570 | 0.1989 | 0.1940 |

For the sake of convenience, first renumber the index after the gray relational model [

62.4 million sample input datasets were integrated according to the smelting period and one-to-one correspondence with the content of [Si] in molten iron; after that, a sample set size of 26,000 was formed. Then, extract x1 ~ x19 as sample input, and design the essence sample set (sample size is 25996) corresponding to the sample input and output according to the correlation statistical analysis results in Table

In this paper, the support vector machine parallel algorithm is designed, and a serum sample set is determined based on the [Si] content of the dynamic prediction model. Select the 24996 group of samples as the training set in chronological order, after selecting the group of samples as a test set of 1000.

Experiments were carried out on personal computers and cloud computing platforms. Personal computer configuration is 3.20 GHz frequency, with 8 GB memory, and cloud computing platform configuration is 1 master node server and 20 slave node servers. Each node processor is Intel® Xeon® CPU E5620, the frequency is 2.40 GHz, the operating system is 64-bit Debian Linux, and the Hadoop platform version is hadoop-0.20.2.

The number of iterations in the Pegasos algorithm is

Algorithm performance indicators are divided into two categories; in the [Si] content dynamic prediction simulation, the predicted hit rate (HR) and the training sample set are used to study the learning time

Formulas (

Based on the above experimental design criteria, the [Si] content of the blast furnace iron-making process is predicted, and the predicted hit rate of the three algorithms is shown in Table

Statistical table of dynamic prediction of the [Si]% value of three different algorithms.

Error range | SVM algorithm for the serial Pegasos solver | The parallel SVM algorithm for Pegasos | SVM algorithm for the structural parallel Pegasos solver | |||
---|---|---|---|---|---|---|

Hit rate | The cumulative hit rate | Hit rate | The cumulative hit rate | Hit rate | The cumulative hit rate | |

33.2% | 33.2% | 30.1% | 30.1% | 32.9% | 32.9% | |

17.1% | 50.3% | 16.5% | 46.6% | 18.4% | 51.3% | |

18.4% | 68.7% | 11.7% | 58.3% | 14.2% | 65.5% | |

14.1% | 82.8% | 15.6% | 73.9% | 11.3% | 76.8% | |

2.5% | 85.3% | 10.4% | 84.3% | 7.4% | 84.2% | |

3.1% | 88.4% | 3.0% | 87.3% | 7.0% | 91.2% | |

11.6% | 100% | 12.7% | 100% | 8.8% | 100% | |

Time (ms) | 105032926 (order of magnitude 10^{9}) |
1921829 (order of magnitude 10^{7}) |
1928397 (order of magnitude 10^{7}) |

The predictive hit rate

Summarization of statistical results of [Si]% lifting dynamic prediction performance of three different algorithms.

Error range | Time (ms) | ||||
---|---|---|---|---|---|

Interpolation of SVM algorithm for serial Pegasos | 77.50% | 72.31% | 77.90% | 74.92% | 9,004,328 |

Parallel Pegasos solving SVM algorithm | 82.55% | 63.13% | 84.07% | 72.62% | 1,776,882 |

Parametric parallel Pegasos solving SVM algorithm | 92.2% | 85.47% | 86.33% | 85.90% | 1,710,788 |

The distribution of the results around the actual results in the blast furnace [Si] content prediction is shown in Figure

Fluctuation of dynamic loading results in different regions of true value.

Comparison of dynamic prediction results and actual values of [Si] content.

Compared with the evaluation of the algorithm, the SVM algorithm of structure and parallelism not only preserves the excellent performance of SVM (the accuracy of the absolute error can reach 91.2% of the predicted hit rate), and the algorithm of SVM speed has improved significantly (time-consuming than the serial algorithm which increased by 54 times). In addition, a 92.2% high hit rate was obtained in the notice of dynamic fluctuation of molten iron [Si], and the SVM algorithm was improved by 5 times in the solution speed.

It is noteworthy that the silicon content prediction algorithm is an analog quantity prediction. It has high precision requirements for numerical values. The traditional serial algorithm takes a lot of time. The parallel structured SVM algorithm can improve the learning speed of the algorithm more effectively. The silicon content up-and-down prediction algorithm is a two-classification problem, which has low precision requirements for numerical values and focuses on pattern recognition, and the traditional serial algorithm takes less time. The effect of using the parallel structured SVM algorithm to improve the learning speed of the algorithm is not obvious. This research carried out scientific algorithm design and simulation experiments. The empirical results show that the time-consuming improvement effect of the silicon content prediction algorithm is significantly higher than the silicon content up-and-down prediction algorithm.

The support vector machine is a machine learning algorithm based on the maximum interval theory. The biggest advantage of this algorithm is that it can avoid dimensionality disaster with kernel function and realize the maximum generalization performance of the algorithm by means of the structural risk minimization principle, and the algorithm is mainly applicable to small sample data. However, the processing of big data samples is not optimistic, especially in the solution of SVM which presents a serious shortage. Therefore, it is very necessary to design a parallel SVM algorithm in the Hadoop platform to improve the algorithm to solve the speed. This study researches the SVM algorithm for stochastic subgradient projection, and based on the characteristics of the execution time of the algorithm, a stochastic subgradient algorithm based on AP clustering is designed and used in SVM solution.

In the algorithmic link, the data set of the size of 6240 × 104 × 37 is processed, and the result of dynamic prediction of blast furnace [Si] content is obtained by using the random subgradient algorithm with fine sample set and application structure and parallelization.

In view of the advantages of the algorithm in forecasting accuracy and algorithm solving speed, it is worthy to be popularized in [Si] dynamic forecasting of blast furnace iron-making. The promotion route is divided into the following three steps:

Based on the Hadoop platform, the SVM parallel algorithm is designed to train the sample data, and then the best [Si] content dynamic forecasting model is applied to the practice.

Based on the practical effect of the target customers, the new data of the on-site production process is supplemented into the training samples to optimize the dynamic prediction model of the [Si] content.

The SVM parallel algorithm is designed in this paper to configure the Hadoop platform for the blast furnace production site in real time. The real-time updating of the real-time data collection and training sample set is realized, and the [Si] dynamic forecasting model is provided in real time.

The data used to support the findings of this study are available from the corresponding author upon request.

The authors declare that there is no conflict of interest regarding the publication of this paper.

This work was financially supported in part by the National Natural Science Foundation of China (51504080), in part by the Science and Technology Project of Hebei Education Department (BJ2017021), and in part by the Outstanding Youth Fund Project of North China University of Science and Technology (JQ201705).