A Peak Prediction Method for Subflow in Hybrid Data Flow

Subflow prediction is required in resource active elastic scaling, but the existing single flow prediction methods cannot accurately predict the peak variation of subflow in hybrid data flow.'ese do not consider the correlation between subflows.'e difficulty is that it is hard to calculate the correlation between different data flows in hybrid data flow. In order to solve this problem, this paper proposes a new method DCCSPP (subflow peak prediction of hybrid data flow based on delay correlation coefficients) to predict the peak value of hybrid data flow. Firstly, we establish a delay correlation coefficient model based on the sliding time window to determine the delay time and delay correlation coefficient. Next, based on the model, a hybrid data flow subflow peak prediction model and algorithm are established to achieve accurate peak prediction of subflow. Experiments show that our prediction model has achieved better results. Compared with LSTM, our method has decreased the MAE about 18.36% and RMSE 13.50%. Compared with linear regression, MAE and RMSE are decreased by 27.12% and 25.58%, respectively.


Introduction
e hybrid data flows are widely used in practical applications. For example, Alibaba's e-commerce platform uses a large-scale hybrid technology. is technology mixes online services with offline tasks. Hybrid data flow consists of online services and offline tasks. ey enter the cluster at the same time and save the cost without affecting service quality. e flow peak prediction is important in the active elastic expansion of the system [1]. Lombardi et al. [2] propose a novel elastic scaling approach, named ELYSIUM which contains the "predictionInputLoad" method to predict the maximum load. Bauer et al. [3] describe a new hybrid autoscaling mechanism, called Chameleon. Chameleon employs on-demand, automated time series-based forecasting methods to predict the arriving load intensity in combination. Hirashima et al. [4] give a new autoscaling mechanism which changes the scale of the target system based on the predicted workload.
In the active elastic scaling of the flow processing system, there are some studies on peak flow prediction. e authors regard network flow as a whole in the existing prediction methods. ere are some traditional methods for network flow prediction, such as the ARIMA linear model and wireless network flow prediction model based on combinatorial optimization theory. Meanwhile, with the development in the neural network, the support vector machine (SVM) and other prediction model based on machine learning algorithm appears. Some authors use neural network models such as RNN [5], NARX recursive neural network model, LSTM [6], and GRU for predicting network peak flow. ese prediction models can well explain the randomness and periodicity of flow.
However, the above methods are based on single flow prediction, without considering the possible correlation between individual flows in a hybrid data flow. erefore, aiming at considering the influence of data correlation on peak flow prediction, this paper proposes a flow prediction method, named DCCSPP (subflow peak prediction of hybrid data flow based on delay correlation coefficients). We establish a delay correlation coefficient model to solve the correlation uncertainty of different subflows and consider the correlation influence between subflows based on the predicting results of the single flow. e more accurate the prediction of flow peaks, the more reliable the system flow information will be obtained, and this will provide better indexing parameters for the system's elastic scaling.

Related Work
In recent years, flow predictions based on time series have always been an attractive research area. Developing predictive models plays an important role in interpreting complex real-world elements [7].
Many of the traditional learning methods are used for time series prediction. Zhang et al. [1] propose an agile perception method to predict abnormal behavior. Yu et al. [8] describe an ARIMA linear model to predict network flow sequence. Aiming at solving the problem that a single model cannot fully describe change characteristics, a wireless network flow prediction model based on combinatorial optimization theory is proposed by Chen and Liu [9]. Liu et al. [10] give online learning algorithms for estimating ARIMA models under relaxed assumptions on the noise terms. Adebiyi et al. [11] examine the forecasting performance of ARIMA and artificial neural networks model. Wu and Wang [12] investigate time series prediction algorithms by using a combination of nonlinear filtering approaches and the feedforward neural network (FNN). Joo and Kim [13] propose a forecasting method based on wavelet filtering. Han et al. [14] introduce a multioutput least square support vector regressor. Chandra and Al-Deek [15] discuss a vector autoregressive model for prediction at short-term flow prediction on freeways. Conventional techniques for time series prediction are limited in their ability to process big data with high dimensionality, as well as efficiently represent complex functions. If the amount of linear data are not too large, the statistical method is reliable enough to be used for prediction. At the same time, the generated model is very complex and difficult to be implemented by nonlinear data types, so the prediction results are not very accurate when there are massive data.
Deep learning-based models have been successfully applied in many fields to time series prediction. ere are many prediction models, which based on machine learning have been proposed. Haviluddin and Alfred [16] introduce a NARX recursive neural network model to predict network flow. Nie et al. [17] propose a novel network flow prediction method based on deep belief network (DBN) and logistic regression model for network flow prediction. In [18], network flow prediction of neural network models such as RNN [5], LSTM [6], and GRU is used. Hoermann et al. [19] report a deep CNN model for dynamic occupancy grid prediction with data from multiple sensors. e advantage of a Gaussian processes lies in its ability of modeling the uncertainty hidden in data, which is provided by predicting distributions [20]. Deep learning-based models are good at discovering intricate structure in large data sets [7]. ese prediction models can well explain the randomness and periodicity of flow.
As mentioned above, the above methods are all for single flow prediction, without considering the possible correlation between data flows in hybrid flow. However, in the hybrid data flow, there is a lack of research on such flow prediction. erefore, this paper mainly studies the correlation between different subflows in the hybrid flow and the peak prediction of each subflow.

Delay Correlation Coefficient Model Based on Sliding Time Window
In hybrid data flows, there are different degrees of correlation between different subflows. Considering the correlation between subflows and the pseudocorrelation caused by time analysis, this paper proposes a delay correlation coefficient model, which adds sliding time window according to Pearson correlation coefficient and time difference analysis [21]. is model is to calculate the delay correlation coefficient and delay time difference between different subflows. Based on the delay coefficient, the data flow that has an influence on the target subflow prediction is filtered out. Correlation analysis [21] refers to the measure the closeness of the variables between two or more related variable elements. Correlation elements need to have a certain connection or probability to conduct correlation analysis.
e Pearson correlation coefficient, also known as Pearson product-moment correlation coefficient, represents the linear correlation between the two sets of variables X and Y. e formula is shown as follows: Equation (1) is the covariance formula. e covariance is divided by the standard deviation of the two related variables to obtain the Pearson correlation coefficient, which is described in formula (2). It is to compensate for the weak representation of the covariance value in the degree of random variable correlation: e Pearson correlation coefficient can always be between [− 1, 1]. e closer the coefficients are to the extremes at both ends, the greater the linear relationship between the two random variables. If the coefficient is close to 0, it means that the two variables are not linearly related. If the coefficient approaches 1, it means that X and Y can be well described by the straight line equation, all data points fall well on a straight line, and X increases as Y increases. e coefficient approaching − 1 means that all data points fall on a straight line, and X decreases as Y increases.
In the flow processing system, the input of data is generally composed of multiple subflows, which we call it a hybrid data flow. is article defines the hybrid data flow as follows e hybrid data flow in the k period is where n indicates that there are n kinds of data flows and (t i , a j ) indicates that data belonged to the jth data flow arrives system at the time of t i .
e data set constituting a business is U � a 1 , a 2 , . . . , a z }, where z indicates that the data set of the service consists of z kinds of data flows. us, service correlation exists in these data. For example, a hybrid data flow consisting of device login information and user behavior information. e flow of user behavior information is affected by the flow of device login information, and the two have a partial-order relationship. Since different service data flows require different processing operations and computing resources, it is necessary to perform shunt operations on the data of the hybrid data flow, as shown in Figure 1.
rough the statistics of discrete hybrid data, the observation sequence of each subflow is obtained. A set of hybrid data flow observation sequences composed of subflow observation sequences are defined.  e size of the sliding time window is h, as shown in Figure 3.
Definition 7. e correlation coefficient of m i and m j when the delay time is dρ(m i , m j ) e . e calculation formula of dρ(m i , m j ) e is described in the following formula: Figure 4.
When predicting m i , it is necessary to select the data flow m k (1 ≤ k ≤ n and k ≠ i) with the highest delay correlation for the auxiliary prediction. e selection formula of m k is as follows: Algorithm 1 gives the pseudocode for selecting the auxiliary data flow algorithm as follows.

Hybrid Data Flow Subflow Peaking
Prediction Model e selected data flow m i (i.e., X) is separately predicted by a single flow prediction method, and an initial prediction ′ represents an initial prediction result for the value x t at time t in X. Definition 9. e variation in x at time t is Δx t . Δx t represents the difference between the single prediction result at time t and time t − 1. e calculation formula is as follows: Definition 10. e amount of change in y at time t is Δy t . Δy t represents the difference between the observed value at time t − e and t − e − 1. e calculation formula is as follows: Definition 11. To scale the range of y to the range of x in a same level, we defined pro t− 1 , which is described as follows: Definition 12. At the time t, the final prediction result of x t is x t ″ . e calculation formula is as follows: where α represents the weight of the correlation coefficient, and the calculation formula is as follows: Algorithm 2 gives the pseudocode for the hybrid data flow correlation prediction algorithm as follows.
e evaluation indexes in this paper are root mean square error (RMSE) and mean absolute error (MAE). e calculation formulas are as follows: y t-e-h+1 y t-e-h+2 x t … … …

RMSE
e smaller the mean absolute error index value is, the more accurate the prediction result is. e smaller the root mean square error value is, the fewer the abnormal discrete points are, and the higher the prediction accuracy is.

Data Set.
In order to analyze the prediction performance of the prediction method proposed in this paper, the device login data and behavior acquisition data provided by the mobile phone APP of a credit company in three periods of three months are selected. We collect 13,567 pieces of equipment login data and 282,685 pieces of behavioral data in a certain period of June, as data set 1, as shown in

Scientific Programming
Each subset selects 4465 observations. From Figures 5-10, we can see that the change trend of device login statistics and behavior collection statistics is close, and there is a correlation between them. In the experiment, firstly, the results predicted by LSTM and unary linear regression model are as the control group. en, the results by our model are as the experimental group. In the end, compare their prediction indicators and error indicators of peak prediction.

Compared with LSTM Prediction Method.
In this paper, the first 90% observed values of each data set is selected as training sets to train the LSTM learning model, and the last 10% is used as the test set to analyze the predictive ability of the model. e overall prediction results of the test sets of data set 1, data set 2, and data set 3 are obtained, as shown in Figures 11-13. And the prediction results for a period with high observed values in data set 1, data set 2, and data set 3 are shown in Figures 14-16. In the DCCSPP, it is necessary to intercept the observation value of time window size for calculation, so before 90, the prediction method cannot give the prediction result, and the value is 0.
In this paper, we need to discuss the influence of time window, and the results of experiment on data set 2 are shown in Figure 17.
Compared with the LSTM model, it can be seen from Figures 14-16 that the results changes in DCCSPP are closer to the real-observed values.
It can be seen from Figure 17 that the selection of time window has certain influence on the prediction results. Too small or too large time window has a bad influence on the prediction results. erefore, in addition to data set 3, this article selects 90 as the size of the time window. On data set 3, the prediction method can get better prediction results when the time window size is 240. e errors of prediction results for data set 1, data set 2, and data set 3 in this paper are shown in Table 1. e prediction method has the most obvious improvement in data set 1. MAE and RMSE decreased by 13.46% and 17.80%, respectively. And we found that the smaller values of the test set of data set 2 lead that the MAE and RMSE of data set 2 are smaller than the others. In the end, the overall results show that the accuracy of the prediction results can be improved by using the correlation coefficient algorithm based on the prediction results of the LSTM model. is paper compares the calculation indexes of prediction results of multiple maximum peak points in data set 1, data set 2, and data set 3, and the results are shown in Table 2. It illustrates that the peak prediction in the test set is not accurate due to the unfavorable data in the training set of data set 1. e method proposed in this paper can significantly improve the index of peak prediction, with MAE and RMSE increasing by 41.46% and 33.79%, respectively. In data set 2 and data set 3, MAE is decreased about 12.83% averagely. However, the improvement in the RMSE index was limited, with an average increase of 3.3%. In conclusion, the method proposed in this paper can improve the final peak prediction results.

Time series
Observation Simple linear regression DCCSPP experimental results show that the MAE value and the RMSE value are decreased by 15% to 26%. In conclusion, the method proposed in this paper used in the unary regression model can greatly improve the accuracy of the prediction results. is paper compares the prediction results of multiple maximum peak points in data set 1, data set 2, and data set 3, and the results are shown in Table 4. As can be seen from the chart, 13 peak points with the highest observed values are selected in data set 1 to calculate the improvement of MAE and RMSE. ey increase 33.45% and 28.73%, respectively. And 8 peak points with the highest observed values are selected in data set 2 to calculate the improvement of MAE and RMSE. ey improve 32.40% and 29.49%, respectively. In data set 3, the 11 peak points with the highest observed values are selected to calculate the MAE and RMSE, which increase 15.50% and 18.52%, respectively. In conclusion, the method proposed in this paper can improve the final peak prediction results in the single-variable linear regression model's peak prediction results. e chart information of experiment 1 and experiment 2 can be obtained. e method proposed in this paper can improve the prediction results in both overall prediction and peak prediction. Compared with the LSTM method, MAE  329  333  337  341  345  349  353  357  361  365  369  373  377  381  385  389  393  397  401  405  409  413  417  421  425 Data size Time series    75% and 19.54%, respectively. erefore, the peak prediction method of hybrid data subflow proposed in this paper can effectively improve the result based on the prediction result.

Conclusions
For the hybrid data flow, there are related uncertainties in each subflow at different times. is paper establishes the delay correlation coefficient model. rough this model, the delay correlation coefficient and delay time are calculated. e prediction results of the respective flows are calculated by using the peak prediction method in the hybrid data flow. Experiments show that the DCCSPP model has good prediction results when there is uncertainty between the subflows in the hybrid flow.
In future work, we will introduce the correlation between subflows into the machine learning model. Using machine learning methods improves the accuracy of delay correlation coefficient calculations and the prediction results. At the same time, the model can also be applied to dynamic hybrid data flows. Design a dynamic allocation scheme based on the predicted peak results of each subflow, dynamically allocating resources to systems that require elastic scaling.
Data Availability e data used in the paper came from an insurance company of China. Subject to the confidentiality agreement, the experimental data set cannot be disclosed to the public, and the name of the company cannot be mentioned in the paper. However, we guarantee that the data set used is authentic with the company.

Conflicts of Interest
e authors declare that there are no conflicts of interest regarding the publication of this paper.