An Empirical Study for Adopting Machine Learning Approaches for Gas Pipeline Flow Prediction

As industrial control technology continues to develop,modern industrial control is undergoing a transformation frommanual control to automatic control. In this paper, we show how to evaluate and build machine learning models to predict the flow rate of the gas pipeline accurately. Compared with traditional practice by experts or rules, machine learning models rely little on the expertise of special fields and extensive physical mechanism analysis. Specifically, we devised a method that can automate the process of choosing suitable machine learning algorithms and their hyperparameters by automatically testing different machine learning algorithms on given data. Our proposedmethods are used in choosing the appropriate learning algorithm and hyperparameters to build themodel of the flow rate of the gas pipeline. Based on this, themodel can be further used for control of the gas pipeline system.'e experiments conducted on real industrial data show the feasibility of building accuratemodels withmachine learning algorithms.'emerits of our approach include (1) little dependence on the expertise of special fields and domain knowledge-based analysis; (2) easy to implement than physicalmodels; (3) more robust to environment changes; (4) requiring much fewer computation resources when it is compared with physical models that call for complex equation solving. Moreover, our experiments also show that some simple yet powerful learning algorithms may outperform industrial control problems than those complex algorithms.


Introduction
Machine learning has been playing an increasingly important role in industrial control. In particular, an accurate model used for estimating the state of the complex industry system is essential for automatic control. As shown in Figure 1, the flow rate model is a key part of the comprehensive analysis and control system of natural gas pipelines. Traditionally, industrial models are often built on physical mechanism analysis and industrial expertise, which are called physical models in this paper. Nevertheless, it is costly to build physical models that are based on extensive theoretical and experimental analysis. Some physical models require massive computational resources to calculate the results. In our problem, to build an accurate model to calculate the flow rate of gas pipelines, knowledge about hydromechanics is required. Moreover, calculating the flow rate of gas pipelines needs to solve many complicated flow equations, making it very difficult to control the pipeline system in real time. Adding more relevant factors into a physical model means much more analysis work, such as the shape of the pipeline in our problem. erefore, some physical models omit some relevant factors to keep the model simple. As a result, they are not so accurate or robust to environmental changes. Building statistical models based on machine learning algorithms requires much less expertise in special fields, and one can automate the modeling process by computer. In particular, one can take more relevant factors into consideration to build models that are more accurate and more adaptive to environmental changes.
ere are plenty of existing machine learning algorithms, and choosing a suitable algorithm is essential to build an accurate and robust model. Network architecture search (NAS) [1,2] can automatically design the neural network architecture and choose the proper hyperparameters, but NAS methods are restricted in the neural network algorithm. Although methods based on neural network and deep learning have shown excellent performance on some complex problems such as image recognition and gameplaying [3,4], these methods often cannot achieve satisfactory performance in some relatively simple problems. Our method is not restricted to neural networks so that one can consider the learning algorithms in a much wider spectrum.
Machine learning methods are widely used in industrial control problems. Work [5] proposed an SVM-based method to predict the draining time of oil pipeline. e GBDTalgorithm is used to diagnose the gas sensor faults and predict the power load of the grid. Artificial neural network technology is also a popular method in industry control [6]. Unsupervised learning is often combined with supervised learning to produce a better performance [7,8], and Yang et al. proposed a novel [9] unsupervised learning method based on the dual autoencoder network. e carefully selected training set can filter out anomalies, and the feature selection procedure can remove noisy features. Applying these two preprocessing methods can result in a more stable and powerful model. Papers [10,11] used feature selection methods in industry control. Paper [12] introduced a technique that can perform training sample selection and feature selection simultaneously.

Materials and Methods
is paper is focused on an empirical study on applying popular machine learning methods to the flow prediction problem, which is seldom addressed by data-driven learning models in the literature. Note that in this application-oriented paper, we have not devised methodologically particularly new approach, instead we resort to a comprehensive study on the performance of the existing methods. While it is still worthy that we have adopted the GBDT stacking model and compare it with the baseline GBDT, as shown in Figure 2.
Specifically, linear regression, neural network, random forest, support vector machine, and K nearest neighbors are evaluated in this work. To select the proper algorithm and hyperparameters to model the flow rate of the gas pipeline, we split the flow rate data into training set and testing set. We use the training set to train different machine learning models and report their performance on the testing set. Figure 3, we designed a neural network [13] models with several layers, each layer contains a linear transform operation and a batch normalization [14] operation and passes the output through an activation function. We test several different configurations to find a good neural network architecture.

Training of Neural Network Models.
We split the training set to several minibatches to reduce the requirement of memory. And, we adopted Adam [15] as the optimizer, set the initial learning rate to 0.003, and reduce the learning rate by 0.5 every 5000 steps. We trained each model for a total of 500 epochs.

Gradient Boosting Decision
Tree. Random forest [16] models are those composed of many decision trees. In the decision tree algorithms, an algorithm determines which child node to go base on input attributes, and repeat this step until it reaches a leaf node and outputs the value stored in the leaf node as the prediction. However, the capability of a single decision tree is limited, and it is difficult for a single decision tree to capture complex relationships between features and labels. To solve this problem, we can compose  several decision trees to learn complex relationships. ese class algorithms are called random forest algorithms. Gradient Boosting Decision Tree (GBDT) [17,18] is one kind of random forest algorithms. In GBDT algorithms, each new decision tree learns the residual error of all previous decision trees. When adding the M th decision tree, we want to fit the parameters θ m of this tree to satisfy the following condition: where f m (x i ; θ m ) is a function corresponding to the decision tree with parameters θ m and f k (x i ) is the function determined by first k decision trees. L(y i , y i ) is the loss function that is used to measure the performance of the model. e output of the model composed by M decision trees is

Stacking of Gradient Boosting Decision Trees.
Although gradient boosting decision tree works well in many applications, it is not suitable for many other applications, such as image classification and speech recognition. is work points out that these drawbacks are because these ensemble decision trees are shallow models and cannot perform representation learning. ey try to mitigate this problem by stacking several layers of ensemble decision trees. As illustrated in Figure 2, the first several layers work as feature transformers; instead of aggregating the results of each decision tree, the result of each decision tree is fed to the next layer as features. [19] is a regression developed from the support vector machine algorithm [20]. In support vector machine algorithms, we want to maximize the minimum distance of each data point from the hyperplane. But, in support vector regression algorithms, we want to minimize the maximum distance of each data point from the hyperplane. Figure 4 illustrates the difference and relationship between SVM and SVR algorithms.

K-Nearest
Neighbor Regression. K-nearest neighbor regression [21] is a regression algorithm that predicts the result based on K neighbors' ground truths that are closest to the given data point. We can take the average of K nearest neighbors' ground truths as prediction or we can weigh the K nearest neighbors' ground truth by the distance between the data point and the neighbour.

Performance Measure.
We use mean square error (MSE), coefficient of determination (R 2 ), and mean relative error (MRE) to measure the performance of our machine learning models. e definition of mean squared error can be described by the following equation: where N is the size of the dataset, y i is the ground truth of i th data item, and y i is prediction given by the machine learning model [22]. e following equation describes the definition of the coefficient of determination (R 2 ): where var(y) is the variance of labels and var(y − y) is the sum of the square of the error between predictions and labels. R 2 describes how much variance can be explained by the model. e definition of mean relative error (MRE) can be described by the following equation: 2.6. Input Data and Label. We obtained the data from the sensors deployed in our monitoring systems. e dataset contains the following data:  Figure 5; from the figure, we can see that most standard flow rate values are around 30000, but there are also some values distributed around 15000.

Training Set and Testing Set Splitting.
e dataset contains 109741 items in total, and we split the dataset to a training set which has 103741 items and a testing set with 6000 items.

Data Normalization.
We subtract mean value from the dataset and divide the result by standard deviation to generate the normalized dataset. We compute the mean value and the standard deviation on the training set. e data normalization process can be described in the following equation: μ � mean value of traning set, σ � standard deviation of training set, for each x i in traning set and testing set.
2.9. Benchmark. We take the mean value of labels and the linear regression model as the benchmark and compare it with other models.

Linear Models.
Among all tested methods, linear models are the simplest models and have fewer parameters than other models. Models with fewer parameters are less prone to overfitting but may be incapable of modeling complicated relationships between input and labels. We tested several different linear models with parameter regularization. Lasso regression [23] is the linear regression with L1 regularization on parameters, and ridge regression [24] is the linear regression with L2 regularization on parameters. e accuracies of linear models are worse than other methods except SVR, but linear models have the merit of minimum computation resource requirement. e experiments with different regularization strength suggest that the linear models are simple enough, and adding additional regularization hurts the performance of linear models in this problem. e results are shown in Figure 6 and Table 1.

GBDT.
We chose mean squared error as the loss function and tested several different learning rates and maximum numbers of leaves in each decision tree. Moreover, we tested the GBDT models with several different maximum numbers of leaf nodes in each decision tree, and the results are given in Figure 7 and Table 2. e more the leaves in each decision tree, the more sophisticated functions between input and output can be learnt by GBDT models. In this application, we found that GBDT models with more leaves yields a better accuracy, but the improvement is negligible when then leaves number larger than 10000. We also tested how different learning rates influence the performance of GBDT models. e learning rate controls how many residual errors to be eliminated when adding a new decision tree to the GBDT model. e results given in Figure 8 and Table 3 suggest that setting the learning rate too low or too high will hurt the accuracy of GBDT models.

Stacking of GBDT.
We also tested the performance of stacked GBDT models with different learning rates. e results given in Figure 9 and Table 4 do not show an improvement when compared with normal GBDT models as  Mathematical Problems in Engineering 5 we expected. is may be the result of overfitting of stacked GBDT models which have much more parameters than normal GBDT models.

KNN.
We conducted a series of experiments to study the influence of the number of selected neighbors and the different averaging methods when calculating the prediction based on the nearest neighbors' ground truth. e results are shown in Figure 10 and Table 5. From the result, we can find that the accuracy is getting worse as the number of neighbors increases when simply averaging the labels of each selected neighbor. We can get rid of this problem when using distance between selected neighbors and input as weight of each label of selected neighbors.

SVR.
We tested three kernel functions to find out which one is most suitable when applying the SVR algorithm to this         Figure 10: Performance variation gives different neighbor numbers and averaging methods. e results suggest that, when taking the distance between the given data point and neighbors as weight, the performance gets better as the number of neighbors increases. However, the performance gets worse as the number of neighbors increase when using equal weights for each neighbor. ese results suggest that weight based on distance is a better choice for this problem. problem. e results given in Figure 11 and Table 6 are even worse than linear models. SVR models with linear kernels are just linear models whose optimization object is different from the aforementioned liner models. e object of the SVR algorithm is to minimize the maximum divergence between ground truth and predicated value, and we do not adopt this criterion when evaluating our models.

Neural Network.
We conducted several experiments to investigate how the performance of neural network models changes when using different numbers of layers. e results are shown in Figure 12 and Table 7. In our experiments, we found increasing the number of layers of neural network will reduce the error, but the error gets larger after adding too many layers to neural network models.
We also conducted several experiments to investigate how the performance of neural network models relies on the number of units in each layer. e results are shown in Figure 13 and Table 8. We found adding more units in each layer of neural network model reduces the average absolute error consistently, but the average relative error decreases first and increases when too many units are added.         We tested several neural network models with different activation functions. e results are shown in Figure 14 and Table 9. An interesting phenomenon found in this set of experiments is that neural models using Leaky ReLU as activation function performs way much better than other activations.

Discussion
We give a comparison of different models on Table 10. From the table, we can conclude that the GBDT algorithm yields the best result among all methods tried. By comparing the performance of different hyperparameter settings of GBDT models, we can discover that a carefully selected hyperparameter setting can improve the performance significantly. is procedure is time-consuming if done manually, so we automated this procedure by testing different preset hyperparameters. e result of stacked GBDT models is very close to GBDT models, but stacked GBDT models are much complex and time-consuming than simple GBDT models, so a simple GBDT model is a better choice in this problem. KNN regression is a simple yet powerful method on this problem, and its performance is better than the neural network models we tried on this problem. is result suggests that neural network models may not be a wise choice for simple table datasets. e results of linear models and SVR models show that the relation between input and output in this problem cannot be grasped by linear models. e SVR models yield the worst performance among all tested methods, even with kernels which can map the input features into higher dimensions. When using the neural network algorithm, the leaky ReLU activation function is recommended. is activation function outperforms other activation functions by a big margin.

Conclusion and Future Work
In this paper, we have presented a comprehensive empirical study on the performance of different popular machine learning models for the task of the flow rate of gas pipeline prediction. For future work, we are going to explore the adoption of temporal point process [25][26][27][28][29] for relevant learning tasks in gas pipeline system for its dynamic nature. We will also explore the structure information [30][31][32] to improve flow prediction from the graph computing perspective.

Conflicts of Interest
e authors declare that they have no conflicts of interest.