Traffic Flow Prediction Model for Large-Scale Road Network Based on Cloud Computing

To increase the efficiency and precision of large-scale road network traffic flow prediction, a genetic algorithm-support vector machine (GA-SVM)model based on cloud computing is proposed in this paper, which is based on the analysis of the characteristics and defects of genetic algorithm and support vector machine. In cloud computing environment, firstly, SVM parameters are optimized by the parallel genetic algorithm, and then this optimized parallel SVM model is used to predict traffic flow. On the basis of the traffic flow data of Haizhu District in Guangzhou City, the proposed model was verified and compared with the serial GA-SVMmodel and parallel GA-SVMmodel based on MPI (message passing interface). The results demonstrate that the parallel GA-SVMmodel based on cloud computing has higher prediction accuracy, shorter running time, and higher speedup.


Introduction
Large-scale road network has high complexity, strong nonlinear, and high dynamic.The mass of traffic flow data brings enormous difficulties to traffic flow prediction.The traditional serial traffic flow prediction methods cannot meet the needs of real-time and accuracy.In order to solve this problem, the experts and scholars at home and abroad have been dedicated to the study on parallel traffic flow prediction methods and have achieved some research results.For example, Li et al. proposed a parallel traffic flow prediction method of space-time two-dimensional integration based on SVM, but this method is more suitable for emergency cases, not very practical for the normal traffic condition [1].Deng realized parallel neural network method based on dish network by Charm++ programming model and applied to traffic flow prediction.When the number of parallel nodes is 110, the running time of two thousand links was 520.88 s [2].Wang et al. implemented a parallel generalized neural network method for traffic flow prediction based on MPI (message passing interface) programming model.The experimental results showed that the speed of the proposed method was more than two times as fast as the serial method [3].Wang proposed a parallel traffic flow prediction method based on SVM, and the experimental results showed that the result of parallel SVM method is better than parallel BP neural network method, and when the number of parallel nodes is 100, the running time of two thousand links was 36.48 s [4].Zhang presented a short-term traffic flow prediction method based on genetic neural network in cloud computing environment.Its running efficiency was more than fourteen times as fast as the serial genetic neural network method [5].
To some extent, the above research results can solve the problem of large-scale road network traffic flow prediction, but there are some limitations, such as huge resource consumption and long running time.A large number of research results show that SVM is widely used in traffic flow forecasting and has certain advantages [1,4,6].However, SVM still has some shortcomings; for example, it needs large store space and longer training time when it deals with large amounts of traffic flow data, so people have developed parallel SVM to reduce the computing cost and improve the running efficiency.

Support Vector Machine
SVM is developed based on the statistical theory.The numerous research results show that it has advantages in solving the problem of nonlinear, high dimension, and local minimum points, which becomes a research hot spot [6].Traffic flow prediction is a kind of nonlinear regression problem, so SVM is widely used in the field.The idea of solving this problem is as follows.
Known training set  = {( 1 ,  1 ), . . ., (  ,   ), . . ., (  ,   )},   ∈   , is the factors which impact traffic prediction,   ∈  is predictive value of traffic flow,  = 1, . . ., and  is the number of training samples.The traffic flow has the inevitable connection to the traffic flow in several periods before, so   is the traffic flow in several periods before.A nonlinear function () = [ 1 (),  2 (), . . .,   ()]  is introduced to map the training data from the lower dimensional feature space to high-dimensional feature space.Then we build linear decision function to make the original nonlinear problem into a linear problem in the high dimensional feature space: An insensitive loss function The above question is a question of quadratic programming with inequality constraints.Then a kernel function (  , ) = (  )  (  ) is introduced.We select the radial basis kernel function.And then the method of Lagrange multiplier is used to get the following formula: where   and  *  are the Lagrange multiplier.And then the above problem is solved to rewrite the regression function (1) as follows: where (  , ) =  −‖  −‖/ 2 ,  > 0. Thus it can be seen that the SVM parameters , insensitive loss factor , and the kernel function parameter  have a greater influence on the calculation results, so we use parallel genetic algorithm based on cloud computing to optimize them.

Parallel Genetic Optimization SVM
Genetic algorithm (GA) has two shortcomings.First is easy to premature convergence, falling in a local optimum; the second is more time consuming in the selection, crossover and mutation steps, resulting in low efficiency.Considering the parallelism of genetic algorithm, the parallel GA is arisen at the historic moment.In this paper, Hadoop is used to implement the parallel genetic algorithm, which can avoid local convergence of genetic algorithm and improve the efficiency of genetic algorithm.The optimization of SVM based on the parallel genetic algorithm is a restricted area search problem.The three parameters are limited within a certain area based on the characteristics of traffic flow.The following research contents are several key problems of parallel genetic optimization SVM based on cloud computing.

Chromosome Coding.
Chromosome coding is convenient to calculate and improve the computing speed of the optimal solution.It is the process of translating the form of the problem to solve into the string form encoded, which can be identified by genetic algorithm.The binary coding method is selected to encode the parameters of SVM.The desirable coding range of , , and  is [0.1, 150], [0.01, 0.5], and [0.01, 10] in traffic flow prediction.Because 150 is between 2 7 ∼ 2 8 , the coding of the parameters needs 8-bit binary; encoding length is determined by the time.
Decoding is the process of translating the form of encoded string into the form of the solution.Decoded by the following formula: where   is the parameter and   is the -bit of binary coding for the parameters,   = 0 or 1.

Fitness Function.
Define a fitness function to guide the evolution of next generation and obtain the optimal solutions to the problem.The precise fitness function can improve the quality of reconciliation and the speed of the algorithm.
Because the purpose of the SVM parameter optimization problem is to find the optimal parameters, the average relative error is chosen for fitness evaluation.

Individual Choice.
The purpose of individual choice is passing on excellent individual whose fitness value is higher than the next generation by copying.It can make excellent individual evolving continually.The roulette wheel selection method is chosen in this paper.The probability of the individual whose fitness value is () is selected as follows:

Crossover and Mutation.
Crossover and mutation are the keys to affect the behavior of GA.Crossover operation is to reserve excellent genes of these parent individuals as far as possible and form a new individual.The aim of mutation is to avoid the algorithm trapping in local optimal solution and keep the diversity of population.Because the variation in nature is in order to adapt to the environment, the adaptive adjustment functions of crossover rate and mutation rate are introduced.In this case, the crossover rate and mutation rate are adjusted constantly to maintain the population diversity, so as to avoid GA into premature convergence.Probability functions are as follows [7]: where  max is the maximum fitness value in current generation, ĝ is the average fitness value in current generation,   is the fitness value of the individual  in the current generation,   is the bigger fitness value in two crossover individuals in the current generation,  is the length of the chromosome, ( max −   )/( max − ĝ) is the degree of the advantages and disadvantages whose fitness value is bigger of two crossover individuals in the current generation, ( max −   )/( max − ĝ) is the degree of the advantages and disadvantages of the individual  in the current generation, and  1 and  2 are the adjustment coefficient.

The Genetic SVM Based on Cloud Computing
MapReduce programming model is a cloud computing programming mode of Google, whose parallelism, fault tolerance and data distribution, load balancing, and so forth are implemented by the system, which is very suitable for processing and generating large data sets [8,9].Meanwhile, MapReduce has the advantages that MPI and any other programming models do not have, such as the function of balanced load and elastic computing and the ability to reduce bandwidth consumption and read latency, which can further improve the running efficiency of GA-SVM [10][11][12].
The parallel computing process is abstracted to the two functions in MapReduce: map( ) and reduce( ).They are as shown in Table 1 [13,14].The data processing of MapReduce is demonstrated by Figure 1 [15].In the map( ), the original data is set into  fragments, which is composed of countless key/value pairs ⟨1, V1⟩.And then, they are input to map( ) for processing.A group of intermediate key/value pairs ⟨2, V2⟩ are output.After key/value pairs ⟨2, V2⟩ are integrated and sorted by system, the key/value pairs which have the same 2 are output.And then they are decomposed into  fragments to input to reduce( ).Finally, the key/value pairs ⟨3, V3⟩ needed are output.
In cloud computing environment, the genetic optimization SVM requires three steps: preparing of the training sample data, training of SVM, and traffic flow forecasting based on SVM trained.Specific steps are as follows.

Preparation of the Training Sample Data.
In order to improve the running speed of the algorithm, first of all is the pretreatment of the collected traffic flow data to generate the data set.The normalized processing by the following formula: where  is the collected data and  is the mapped data.

Training of SVM Based on Parallel Genetic Algorithm.
SVM is trained by genetic algorithm based on MapReduce.Its process is a problem of quadratic programming.The specific steps are as follows [16,17].
(1) Generate an initial parameter population.Generate an initial population of SVM parameters randomly.The populations are encoded and uploaded to Hadoop as a local file.(5) Crossover and mutation operation.Crossover operation is performed on the two individuals selected from the child population through the method of inserting genes.And then produce two new individuals.Perform mutation operation using the method of adaptive mutation to produce new individuals, which make up offspring populations.They are read in HDFS file system in terms of key/value pairs.( 6) Termination conditions.Job Tracker judges whether the evolutionary generations approach the optimum.If true, the algorithm is terminated.Job Tracker consolidates the results, outputs the optimal SVM parameters; if not, turn to step (7).

Parallel Genetic SVM Prediction Algorithm.
In this paper, the process of the traffic flow prediction algorithm based on GA-SVM in cloud computing environment is as follows [18].
( To sum up, the idea of "divide and rule" is adopted in parallel prediction model based on MapReduce.The sample data is divided into the child populations.GA algorithm is realized for child population, respectively, through map( ) and reduce( ).Training SVM only needs the shorter time.Parallel traffic flow prediction is realized using SVM trained to reduce the running time of the algorithm.
where () is the actual value, ŷ() is the average prediction value,  is the number of prediction,   is the running time of the serial algorithm, and   is the running time of the parallel algorithm.City.Software environment is Redhat Enterprise Linux 5.0 virtual machine, Hadoop0.17.1, JRE1.5, and JAVA.Hardware environment is 20 PCs.Among them, one PC is as the master nodes and also as a slave node, the rest of 19 PCs are only as slave nodes.Hadoop0.17.1 and JRE1.5 are installed on the master node and configured to the slave nodes through the SCP command.Haizhu District of Guangzhou City contains 3,174 nodes and 8,914 links, whose network diagram is shown in Figure 2. Haizhu District is from Keyun Road in the east to Binjiang Road in the west and from Yuejiang Road in the north to Nanzhou Road in the south.Among them, one thousand links are chosen for traffic flow prediction.The data is from the SCATS traffic information collection system.A group of data is generated every five minutes.Acquisition times are 7 am to 7 pm.With 4 * 144 = 576 groups of data from Monday to Thursday as the training samples, the traffic flow data on Friday is predicted.The 144 groups of data on Friday are as the actual value to compare with the prediction value.

Example and the Result Analysis
The number of parallel nodes is 1, 2, 4, 8, 16, and 20.The one thousand links are predicted by the serial GA-SVM algorithm, the parallel GA-SVM algorithm based on MPI, and the parallel GA-SVM algorithm based on MapReduce.The basic idea of parallel GA-SVM algorithm based on MPI is "divide and conquer." Firstly, SVM is trained by parallel GA algorithm, and then the SVM trained is used to traffic flow prediction [19][20][21].The performance of three algorithms is compared through a numerical example.

Selection of Experimental Parameters.
When the parameters of SVM are optimized, the number of parallel nodes is 4, the GA population size  = 120, and the maximum evolution algebra 400.

Result Analysis.
The parameters of SVM are optimized by three kinds of algorithms.They are the serial GA, the parallel GA based on MPI, and the parallel GA based on MapReduce.The results of parameter optimization are shown in Table 2.
The performance of the serial algorithm and the parallel algorithm (the number of parallel nodes is 16) is contrasted.It is analyzed from two aspects: prediction accuracy and operation efficiency.

Prediction Accuracy.
The prediction results and RE curves of Link 103-104 by three algorithms are shown in Figures 3, 4, and 5.It can be seen that the imitative effect of the prediction values and the actual values by two parallel algorithms is better than serial algorithm.When the traffic flow is fluctuating greatly, the absolute relative error of parallel algorithm based on MapReduce is relatively stable.Table 3 is the evaluation index of the prediction precision by three kinds of algorithm.It can be seen that MRE, MAXRE, and RMSE of parallel algorithms are smaller than the serial algorithm, so their prediction precision is higher.Because the parallel genetic algorithm can avoid the shortcomings of traditional genetic algorithm, the parameters of SVM are optimized better, and then the prediction precision is improved.6 is the running time contrast of two kinds of parallel algorithm.From the diagram, it can be seen that when the number parallel nodes is less than 4, the advantage of the parallel algorithm based on MapReduce is not obvious, because too little number of parallel nodes that the map phase will spend more time.With the increase in the number of parallel nodes, the advantage of MapReduce is manifested gradually.The running time is reduced greatly.But when the number of parallel nodes is increased to 16, the running time of two kinds of parallel algorithm is reduced slightly; the reason is that, with the increase of the number of parallel nodes, communication costs among the different nodes are increased gradually to increase communication time.Therefore, the proper number of parallel nodes in the process of forecasting can achieve high performance, save resources, and improve efficiency.The speedup is an important index to measure the efficiency of the parallel algorithm.The higher the speedup is, the higher the efficiency will be.Figure 7 is the speedup contrast of two parallel algorithms.It can be seen that with the increase in the number of parallel nodes, the speedup of two algorithms is higher and higher, and the speedup of parallel algorithm based on MapReduce is much higher than the parallel algorithm based on MPI.When the number of

Conclusions
In this paper, we presented traffic flow prediction model for large-scale road network based on cloud computing, which  has been implemented successfully by introducing genetic algorithm and support vector machine.In the process of traffic flow forecasting, we got the optimal parameters of support vector machine, the highest prediction accuracy, and the shortest running time.Finally, we verified the superiority of the proposed algorithm and the model through a numerical example based on Hadoop.
For further issues, we should introduce other algorithms into traffic flow prediction model for large-scale road network and verify it by the larger road network to be closer to the actual situation.

( 2 )( 3 )( 4 )
Initial population.The initialization of population is completed by the master machine (Job Tracker).All individuals are divided into multiple child populations.Set up the parameters of genetic algorithm.And then, assign the parameters and the populations to the slave machine (Task Tracker).Fitness evaluation.Call map( ).Define the child population number as key.Define the chromosomes as value.The fitness evaluation is carried through for every population by Task Tracker to get the fitness value of each individual.The values of key/value pairs which have the same key are reduced and stored in the local HDFS file system.Select operation.Call reduce( ).Job Tracker reads the position of the intermediate file, and messages to reduce( ).The reduce( ) reads intermediate file from a Data Node after receiving instructions.And then the reduce( ) completes selection operation of the child population.Each child population selects two individuals.

)( 2 )( 4 )
The traffic flow sample data collected of large-scale road network is preprocessed.Part of it is used as the training sample, and then the other part is used as the forecasting sample.They are uploaded to the Hadoop.Job Tracker divides training sample and forecasting sample automatically.And then it reads SVM parameters and assigns them to Task Trackers together with the sample data.At this point, each Task Tracker has a small training sample.(3) Task Trackers call map( ).Each small training sample is trained to output prediction results.The prediction results of each training sample are sorted by Job Tracker.And then, Job Trackers call reduce( ).At last, the prediction data tables of the entire road network are output.The performed process of MapReduce is over.The whole algorithm is terminated.

5. 1 .
Design of Experiment.Parallel traffic flow prediction program for the large-scale road network is developed by Java language, Hadoop, GA algorithm, and SVM.The parallel computing experiment platform is set up by 20 PCs.The experiments are carried out to test the proposed algorithm based on the real-time data of Haizhu District in Guangzhou

Figure 3 :Figure 4 :
Figure 3: (a) Prediction results based on the serial algorithm, (b) RE based on the serial algorithm.

Table 1 :
Map and Reduce functions.

Table 2 :
SVM parameter values of three kinds of model.

Table 3 :
Evaluation index of three kinds of algorithms., the speedup of parallel algorithm based on MapReduce is   =   /  = 770.15s/45.42 s = 16.96.Its efficiency is 16.96 times as high as the serial algorithm.