A Review of Traffic Congestion Prediction Using Artificial Intelligence

provided the original


Introduction
Artificial intelligence (AI) is the most important branch of computer science in this era of big data. AI was born 50 years ago and came a long way, making encouraging progress, especially in machine learning, data mining, computer vision, expert systems, natural language processing, robotics, and related applications [1]. Machine learning is the most popular branch of AI. Other classes of AI include probabilistic models, deep learning, artificial neural network systems, and game theory. ese classes are developed and applied in a wide range of sectors. Recently, it has been the leading research area in transportation engineering, especially in traffic congestion prediction.
Traffic congestion has a direct and indirect impact on a country's economy and its dwellers' health. According to Ali et al. [2], traffic congestion causes Pak Rs. 1 million every day in terms of opportunity cost and fuel consumption due to traffic congestion. Traffic congestion affects on individual level as well. Time loss, especially during peak hours, mental stress, and the added pollution to the global warming are also some important factors caused due to traffic congestion.
Ensuring economic growth and the road users' comfort are the two requirements for the development of a country, which is impossible without smooth traffic flow. With the development in the transportation sector by collecting traffic information, authorities are putting more attention on traffic congestion monitoring. Traffic congestion prediction provides the authorities with the required time to plan in the allocation of resources to make the journey smooth for travellers. Traffic congestion prediction problem discussed in this paper can be defined as an estimation of parameters related to traffic congestion into the short-term future, e.g., 15 minutes to a few hours by applying different AI methodologies by using collected traffic data. ere are usually five parameters to evaluate, including traffic volume, traffic density, occupancy, traffic congestion index, and travel time while monitoring and predicting traffic congestions. Depending on the nature of the collected data, a variety of AI approaches are applied to evaluate the congestion parameters. is article systematically discusses the models and their advantage and disadvantages. e primary motivation of this review is to gather the articles focusing solely on traffic congestion prediction models. e keywords used in the search process included "traffic congestion prediction" OR "traffic congestion estimation" OR "congestion prediction modelling" OR "prediction of traffic congestion" OR "road congestion forecast" OR "traffic congestion forecast." For efficient screening, research paper search was done according to year using search engines like Scopus, Google Scholar, and Science Direct. After collecting all the peerreviewed journal and conference papers written in the English language, 48 articles were found for review. Any studies focusing on the cause of traffic congestion, traffic congestion control, traffic congestion impact, traffic congestion propagation, traffic congestion prevention, etc. were excluded from this manuscript.
A general layout of the prediction approaches is provided in Section 2. e data collection sources and congestion forecasting models are explained in Sections 3-6 and they provide the overall discussion and concluding remarks.

General Layout
Traffic congestion forecasting has two basic steps of data collection and prediction model development. Every step of the methodology is important and may affect the results if not done correctly. After data collection, data processing plays a vital role to prepare the training and testing datasets. Case area differs for different research. After developing the model, it is validated with other base models and ground true results. Figure 1 shows the general components of traffic congestion prediction studies. ese branches were further divided into more specific sub-branches and are discussed in the following sections.

Data Source
Traffic datasets used in different studies can be mainly divided into two classes, including stationary and probe data. Stationary data can be further divided into sensor data and fixed cameras. On the other hand, probe data that were used in the studies were GPS data mounted on vehicles.
Stationary sensors continuously capture spatiotemporal data of traffic. However, sensor operation may interrupt anytime. Authorities should always consider this temporary failure of the sensor while planning by using this data. e advantage of the sensor data is that there is no confusion on the location of the vehicles. e most used dataset was Performance Measurement System (PeMS) that collects highway data across all major metropolitan areas of the State of California of traffic flow, sensor occupancy, and travel speed in real-time. Most of the studies used dataset from the I-5 highway, in San Diego, California, every 5 minutes [3][4][5][6]. Other systems included the Genetec blufaxcloud travel-time system engine (GBTTSE) [7] and the Topologically Integrated Geographic Encoding and Referencing (TIGER) line graph [8].
On the other hand, probe data has the advantage of covering the entire road network. A network consists of different structured roads. erefore, studies, especially those that considered the network wide area, used probe data. e most used dataset was GPS data collecting every second from approximately 20000 taxies of Beijing, China. Data included the taxi number, the latitude-longitude of the vehicle, timestamp when sampling, and whether there was a passenger or not. Data updating frequency of this dataset varies from 10 s to 5 min according to the quality of GPS device [4,5,9]. Other probe data included low-frequency Probe Vehicle Data (PVD) [10] and bus GPS data [11,12]. However, sometimes probe data show significant fluctuation. Besides, map matching is usually a must for probe data. But data can minimize this limitation. Probe data collected from one city cannot be used directly for modelling other city networks. is is because the data collected from Beijing, China, includes latitude-longitude of the vehicle, which is unique. However, a generalised model using probe data can be generated for different cities.
Other data sources, e.g., data from tolling system and data provided by transportation authority, will add more reliable data as the sources are dependable. However, a lot of the times, study area needs to be adjusted as in most cases, tolled road information is not available. Tracking cellular phone movements without privacy breach can also be a source of data. However, the heterogeneity of the vehicle distribution will be hard to determine from this dataset, if not impossible. Besides, due to pedestrian or cyclists travelling through the sidewalk, there might be many outliers in the dataset if modelling is done for a road network. Data collected from a questionnaire to the general public/drivers may provide a misleading result [13].

Clustering Algorithms.
Some studies use clustering the acquired data before applying the main congestion models of prediction. is hybrid modelling technique is applied to fine-tune the input values and to use them in the training phase. Figure 2 shows the commonly used AI clustering models in this field of research. e models are described briefly in this section.
Fuzzy C-Means (FCM) is a popular nondeterministic clustering technique in data mining. In traffic engineering researches, traffic pattern recognition plays an important role. Besides, these studies often face the limitation of missing or incomplete data. To deal with these constraints, FCM has become a commonly applied clustering technique. e advantage of this approach is, unlike original C-means clustering methods, it can overcome the issue of getting trapped in the local optimum [14]. However, FCM requires setting a predefined cluster number, which is not always possible while dealing with massive data without any prior knowledge of the data dimension. Besides, this model becomes computationally expensive with data size increment. Different studies have applied FCM successfully by improving its limitations. Some studies changed the fuzzy index value for each FCM algorithm execution [15], some calculated the Davies-Bouldin (DB) index [10], while others applied the K-means clustering algorithm [16,17].
K-means clustering is an effective and relatively flexible algorithm while dealing with large datasets. It is a popular unsupervised machine learning algorithm. Depending on the features, cluster number varied from two [18] to 50 [19][20][21]. Like FCM, K-means clustering requires a predefined cluster number and selecting K original cluster centres. GAP [22] and WEKA toolbox [23] were used to define the value. For large datasets, as the sample distribution is unknown in the beginning, it is not always possible to fulfil these two requirements. A few studies used adaptive K-means clustering overcoming the limitations and exploited the pattern using principal component analysis (PCA) [24,25].
DBSCAN is more of a general clustering application in machine learning and data mining. is method overcomes the limitation of FCM of predefining the cluster number. It

Applied Methodology
Traffic flow is a complex amalgamation of heterogenous traffic fleet. us, traffic pattern prediction modelling could be an easy and efficient congestion prediction approach. However, depending on the data characteristics and quality, different classes of AI are applied in various studies. Figure 3 shows the main branches-probabilistic reasoning and machine learning (ML). Machine learning comprised of both shallow and deep learning algorithms. However, with the progress of this article, these sections were subdivided into detailed algorithms.
To generalise traffic congestion forecasting studies using different models is not straight forward.
e common factors of all the articles include the study area, data collection horizon, predicted parameter, prediction intervals, and validation procedure. Most of the articles took studied corridor segment as the study area [5,[27][28][29][30]. Other study areas included the traffic network [31,32], ring road [9], and arterial road [33]. Data collection horizon varied from 2 years [34] to less than a day [35] in the studies. Congestion estimation is done predicting traffic flow parameters, e.g., traffic speed [4], density, speed [5], and congestion index [31], to mention a few. e Congestion Index (CI) approach is suitable to monitor the congestion level continuously in a spatiotemporal dimension. Studies those compared their results with the ground truth value or with other models used mean absolute error (MAE) (equation (1)), symmetric mean absolute percentage error (sMAPE) (equation (1)), MAPE, root-mean-squared error (RMSE) (equation (3)), false positive rate (FPR) (equation (4)), and detection rate (DR) (equation (5)). Many studies used SUMO to validate their models: where Y � original value, Y i � predicted value, and n � number of instances.
where FP, TN, FN, and TP represent the false positive, true negative, false negative, and true positive, respectively. e rest of this section will discuss the methodology the authors have applied in the studies.

Probabilistic Reasoning.
Probabilistic reasoning is a significant section of AI. It is applied to deal with the field of uncertain knowledge and reasoning. A variety of these algorithms are commonly used in traffic congestion prediction studies. e studies discussed hereunder probabilistic reasoning is shown in Figure 4.

Fuzzy Logic.
Zadeh is a commonly applied model in dynamic traffic congestion prediction as it allows vagueness instead of binary outcomes. In this method, several membership functions are developed those represent the degree of truth. With the vastness with time, traffic data are becoming complex and nonlinear. Due to its ability to deal with uncertainty in the dataset, fuzzy logic has become popular in traffic congestion prediction studies.
A fuzzy system comprises of several fuzzy sets, which is built of membership functions.
ere are usually three codification shapes to choose for the membership functions (MFs) of input: triangular, trapezoidal, and Gauss function. e fuzzy rule-based system (FRBS) is the most common fuzzy logic system in traffic engineering research. It consists of several IF-THEN rules that logically relate the input variables with output. It can effectively deal with the complexity resulting from real-world traffic situations by representing them in simple rules. ese rules combine the relations among different traffic states to detect the resulting traffic condition [36]. However, with the growth in data complexity, the total number of rules also grows, lessening the accuracy of the whole system, thus making it computationally expensive. To better manage this problem, two types of fuzzy logic controls are applied. In hierarchical control (HFRBS), according to the significance, the input variables are ordered and MFs are employed. Figure 5 shows a simple HFRBS structure. MFs are optimized by applying different algorithms, e.g., genetic algorithm (GA) [30], hybrid genetic algorithm (GA), and cross-entropy (CE)

Artificial intelligence
Probabilistic reasoning

Shallow machine learning
Deep machine learning Figure 3: Branches of artificial intelligence in this article. [28,37] compared the performance of evolutionary crisp rule learning (ECRL) and evolutionary fuzzy rule learning (EFRL) for road traffic congestion prediction. It was seen that ECRL models outperformed EFRL in terms of averaged accuracy and no of rules but was computationally expensive. e Takagi-Sugeno-Kang (TSK) (FRBS) model is one of the simple fuzzy models due to its mathematical treatability. A weighted average computes the output of this model. Another simple FRBS model is Mamdani-type model. e output of this model is a fuzzy set which needs defuzzification, which is time-consuming. Due to its good interpretability, it can improve the accuracy of fuzzy linguistic models. Cao and Wang [3] applied this model to show the congestion severity change among road grades. A few studies used this method to fuse heterogenous parameters [7,13]. e TSK model works on improving the interpretability of an accurate fuzzy model. TSK is applied for its fast calculation characteristics [37]. e fuzzy comprehensive evaluation (FCE) uses the principle of fuzzy transformation and maximum membership degree. is model consists of several layers, which is a useful objective evaluation method, assessing all relevant factors.
e number of layers depends on the objective complicacy and the number of factors. Kong et al. [4] and Yang et al. [5] applied FCE in which the weights and the fuzzy matrix of multi-indexes were adapted according to the traffic flow to estimate traffic congestion state. Adaptive control adjusts weight coefficient based on judgement matrix. Certain weights are assigned to calculate the membership degree of the parameters [35].
Other than GA and PSO, Ant Colony Optimization (ACO) algorithm was also introduced by Daissaoui et al. [38] in fuzzy logic system. ey provided the theory for a smart city, where each vehicle GPS data was taken as a pheromone, consistent with the concept of ACO. e objective was to predict traffic congestion one minute ahead from the information (pheromone) provided by past cars. However, the article does not give any result on support to the model.
As discussed before, with the development of optimisation algorithms, optimisation of the fuzzy logic system's membership functions is becoming diverse. With time, the simplest form of FRBS-TSK has become popular due to its good interpretability. Some other sectors of transportation where fuzzy logic models are popular include traffic light/ signal control [39,40], traffic flow prediction (Zhang and Ye [41]), traffic accident prediction [42], and modified fuzzy logic for freeway travel time estimation (Zhang and Ge [43]). e fuzzy logic system is the only probabilistic reasoning model that can have an outcome of more than congested/ noncongested state of the traffic state. is is one of the main advantages that has made this methodology popular. However, no study has provided any reasonable logic on selecting the membership function, which is a significant limitation of fuzzy logic models.

Hidden Markov Model.
e hidden Markov model (HMM) is a combination of stochastic characteristics of Markov process and discrete characteristics of Markov chains. It is a stochastic, time-series event recognition technique. Some studies have applied Markov chain model for traffic pattern recognition during congestion prediction [21,25,44]. Pearson correlation coefficient (PCC) is commonly applied among the parameters during pattern construction. Zaki et al. [32] applied HMM to select the appropriate prediction model from several models they developed applying the adaptive neurofuzzy inference system (ANFIS). ey obtained optimal state transition by four processing steps: initialization, recursion, termination, and backtracking. e last step analysed the previous step to determine the probability of the current state by using the Viterbi algorithm. Based on the log-likelihood of the initial model parameter, defined by expectation maximization (EM) algorithm, of HMM with the traffic pattern, a suitable congestion model was selected for prediction. Mishra et al. [23] applied the discretised multiple symbol HMM (MS-HMM) prediction model named future state prediction (FSP). ey evaluated model adaptability for different road segments. A label was generated containing hidden states of MS-HMM, and the output was used for FSP to result in the next hidden state label.
In traffic engineering, especially while utilising probe vehicle data, HMM is very useful in map-matching. Sun et al. [45] applied HMM for mapping the trajectory of observed GPS points in nearby roads. ese candidate points were taken as hidden states of HMM. e candidate points closer to the observation point had higher observation probability.

Journal of Advanced Transportation
Transition probability of two adjacent candidates was also considered to avoid the misleading results generated from abrupt traffic situations.
HMM shows accuracy in selecting a traffic pattern or a traffic point. It has the advantage that it can deal with the data with outliers. However, points with a short sampling interval seem to be matched well, and long intervals and higher similar probe data decreased the model accuracy. Studies have found a significant mismatch for long sampling interval dataset and similar road networks. e GPS tracking system has been widely developed in this era of the satellite. us, making HMM modelling is currently more relevant for map matching. Other sectors of transport where HMM is applied include traffic prediction [46], modified HMM for speed prediction [47], and traffic flow state transition [48], etc.

Gaussian Distribution.
Gaussian processes have proven to be a successful tool for regression problems. Formally, a Gaussian process is a collection of random variables, any finite number of which obeys a joint Gaussian prior distribution. For regression, the function to be estimated is assumed to be generated by an infinite-dimensional Gaussian distribution, and the observed outputs are contaminated by additive Gaussian noise.
Yang [29] applied Gaussian distribution for traffic congestion prediction in their study. is study was divided into three parts. First, the sensor ranking was done according to the volume quality by applying p test. In the second part of the study, the congestion occurring probability was determined from a statistics-based method. In the learning phase of this part, two Gaussian probability models were developed from two datasets for every point of interest. In the decision phase, on which model the input traffic volume value fitted was evaluated, and a prediction score presenting congestion state was determined from the ratio of two models. Finally, the probability of congestion occurring at the point of interest was found by combining and sorting the prediction score from all the ranked sensors. Zhu et al. [49] also presented the probability of traffic state distribution. Selection of mean and variance parameters of Gaussian distribution is an important step. In this study, the EM algorithm was applied for this purpose. e first step generated the log-likelihood expectation for the parameters, whereas the last step maximised it. Sun et al. [45] approximated the error in GPS location in the road with Gaussian Distribution, taking mean 0. e error was calculated from the actual GPS point, matching point on the road section, and standard deviation of GPS measurement error.
From the abovementioned studies, it is seen that the Gaussian distribution model has a useful application in reducing feature numbers without compromising the quality of the prediction results or for location error estimation while using GPS data. Gaussian distribution is also applied in traffic volume prediction [50], traffic safety [51], and traffic speed distribution variability [52].

Bayesian Network.
A Bayesian network (BN), also known as a causal model, is a directed graphical model for representing conditional independencies between a set of random variables. It is a combination of probability theory and graph theory and provides a natural tool for dealing with two problems that occur through applied mathematics and engineering-uncertainty and complexity [53].
Asencio-Cortés et al. [54] applied an ensemble of seven machine learning algorithms to compute the traffic congestion prediction. is methodology was developed as a binary classification problem applying the HIOCC algorithm. Machine learning algorithms applied in this study were K-nearest neighbour (K-NN), C4.5 decision trees (C4.5), artificial neural network (ANN) of backpropagation technique, stochastic gradient descent optimisation (SGD), fuzzy unordered rule induction algorithm (FURIA), Bayesian network (BN), and support vector machine (SVM).
ree of these algorithms (C4.5, FURIA, and BN) can produce interpretable models of viewable knowledge. A set of ensembled learning algorithms were applied to improve the results found from these prediction models.
e ensemble algorithm group included bagging, boosting (Ada-Boost M1), stacking, and Probability reshold Selector (PTS). e authors found a significant improvement in Precision for BN after applying ensemble algorithms. On the other hand, Kim and Wang [34] applied BN to determine the factors that affect congestion initialization on different road sections. e developed model of this study gave a framework to assess different scenario ranking and prioritizing.
Bayesian network is seen to perform better with ensembled algorithms or while modified, e.g., other transport sectors of traffic flow prediction [55] and parameter estimation at signalised intersection [56,57].

Others.
Other than the models mentioned above, the Kalman Filter (KF) is also a popular probabilistic algorithm. With the increment of available data, data fusion methods are becoming popular. e fusion of historical and real-time traffic data can achieve a higher level of traffic congestion prediction accuracy. In this regard, KF is commonly applied. Extended KF (EKF) is an extension of KF, which can be used to stochastically filter the nonlinear noises to improve the mean and covariance of an estimated state. erefore, after data fusion, it updated the estimated covariance error by removing outliers [7].
Wen et al. [8] applied GA in traffic congestion prediction from spatiotemporal traffic environment. Temporal association rules were extracted from the traffic environment applying GA-based temporal association rules (GATARs). eir proposed Hybrid Temporal Association Rules Mining method (HTARM) included DBSCAN and GATAR methods. e DBSCAN application method was discussed previously in this article. While encoding using GATAR, road section number and congestion level were included in the chromosome. e decoding was done to obtain temporal association rules and was sorted according to confidence and support value in the rule pool. For both simulated and real-world scenarios, the proposed HTARM method outperformed GATAR in terms of extracting temporal association rules and prediction accuracy. However, the cluster number difference showed a big difference in the two scenarios. Besides, with the increment of road network complexity, the prediction accuracy decreased. Table 1 summarises the methodologies and different parameters used in various studies we have discussed so far.

Shallow Machine
Learning. Shallow machine learning (SML) algorithms include traditional and simple ML algorithms.
ese algorithms usually consist of a few, many times, one hidden layer. SML algorithms cannot extract features from the input, and features need to be defined beforehand. Model training can only be done after feature extraction. SML algorithms and their application in traffic congestion studies are discussed in this section and shown in Figure 6.

Artificial Neural Network.
Artificial neural network (ANN) was developed, mimicking the function of the human brain to solve different nonlinear problems. It is a firstorder mathematical or computational model that consists of a set of interconnected processors or neurons. Figure 7 shows a simple ANN structure. Due to its easy implementation and efficient forecasting ability, ANN has become popular in the field of traffic congestion prediction research. Hopfield network, feedforward network, and backpropagation are the examples of ANN. Feedforward neural network (FNN) is the simplest NN, where the input data go to the hidden layer and from there to the output layer. Backpropagation neural network (BPNN) consists of feedforward and weight adjustment of the layers and is the most commonly applied ANN in transportation management. Xu et al. [31] applied BPNN to predict traffic flow, thus to evaluate congestion factor in their study. ey proposed occupancy-based congestion factor (CRO) evaluation method with three other evaluated congestion factors based on mileage ratio of congestion (CMRC), road speed (CRS), and vehicle density (CVD). ey also evaluated the effect of data-size on real-time rendering of road congestion. Complex road network with higher interconnections showed higher complication in simulation and rendering. e advantage of the proposed model was that it took little processing time for high sampling data rendering. e model can be used as a general congestion prediction model for different road networks. Some used hybrid NN for congestion prediction. Nadeem and Fowdur [11] predicted congestion in spatial space, applying the combination of one of six SML algorithms with NN. Six SML algorithms included moving average (MA), autoregressive integrated moving average (ARIMA), linear regression, second-and third-degree polynomial regression, and k-nearest neighbour (KNN). e model showing the least RMSE value was combined with BPNN to form hybrid NN. e hidden layer had seven neurons, which was determined by trial and error. However, it was a very preliminary level work. It did not show the effect of data increment in the accuracy.
Unlike the previous studies, those focused on traffic flow parameters to conduct traffic congestion prediction research; Ito and Kaneyasu [60] analysed drivers' behaviour in predicting congestion. ey showed that vehicle operators act differently on different phases of the journey. ey used one layered BPNN to learn the behaviour of female drivers and extract travel phase according to that. e results showed an average efficiency of 82% in distinguishing the travel phase.
ANN is a useful machine learning model which has a flexible structure. e neurons of the layer can be adapted according to the input data. As mentioned above, a general model can be developed and applied for different road types by using the advantage of nonlinearity capturing ability of ANN. However, ANN requires larger datasets than the probabilistic reasoning models, which results in high complexity.
ANN shows great potential in diverse parameter analysis. ANN is the only model that has recently been applied for driver behaviour analysis for traffic congestion. ANN is popular in every section of transport-traffic flow prediction [61,62], congestion control [63], driver tiredness [64], and vehicle noise [65,66].

Regression Model.
Regression is a statistical supervised ML algorithm. It models the prediction real numbered output value based on the independent input numerical variable. Regression models can be further divided according to the number of input variables. e simplest regression model is linear regression with one input feature. When the feature number increases, the multiple regression model is generated.
Jiwan et al. [27] developed a multiple linear regression analysis (MLRA) model using weather data and traffic congestion data after preprocessing using Hadoop. At first, a single regression model was developed for all the variables using R. After a 3-fold reduction process, only ten variables were determined to form the final MLRA model. Zhang and Qian [22] conducted an interesting approach to predict morning peak hour congestion using household electricity usage patterns. ey used LASSO regression to correlate the pattern features using the advantage of linearly related critical feature selection capability.
On the other hand, Jain et al. [33] developed both linear and exponential regression model using IBM SPSS software to find the relevant variables. e authors converted heterogenous vehicles into passenger car unit (PCU) for simplification. ree independent variables were considered to estimation origin-destination-(O-D-) based congestion measures. ey used PCC to evaluate the correlation among the parameters. However, simply averaging O-D node parameters may not provide the actual situation of dynamic traffic patterns.
Regression models consist of some hidden coefficients, which are determined in the training phase. e most applied regression model is the autoregressive integrated moving average (ARIMA). ARIMA has three parameters-p, d, and q. "p" is the auto regressive order that refers to how Journal of Advanced Transportation many lags of the independent variable needs to be considered for prediction. Moving average order "q" presents the lag prediction error numbers. Lastly, "d" is used to make the time-series stationary. Alghamdi et al. [67] took d as 1 as one differencing order could make the model stationary. Next, they applied the autocorrelation function (ACF) and   [7] e table accumulates the data source, scope of the study area, input and resulting parameters, and how many cognitive traffic states were considered in the studies. * 2 � free/congested, 4 � free/light/medium/severe, 5 � very free/free/light/medium/severe 8 Journal of Advanced Transportation the partial autocorrelation function (PACF) along with the minimum information criteria matrix to determine the values of p and q. ey only took the time dimension into account. However, the results inclined with the true pattern for only one week and needed to be fine-tuned considering prediction errors. Besides, the study did not consider the spatial dimension. Regression models are useful to be applied for time series problems. erefore, regression models are suitable for traffic forecasting problems. However, these models are not reliable for nonlinear, rapidly changing the multidimension dataset. e results need to be modified according to prediction errors.
However, as already and further will be discussed in this article, most of the studies used different regression models to validate their proposed model [6,11,25,68,69].
With the increment of dataset and complexity associated with it, regression models are becoming less popular in traffic congestion prediction. Currently, regression models are frequently used by modifying with other machine learning algorithms, e.g., ANN and kernel functions. Some other sectors' regression models are applied including hybrid ARIMA in traffic speed prediction for specific vehicle type (Wang et al. [70], traffic volume prediction [71], and flow prediction applying modified ARIMA [72].

Decision Tree.
A decision tree is a model that predicts an output based on several input variables. ere are two types of trees: the classification tree and the regression tree. When these two trees merge, a new tree named classification and regression tree (CART) generates. Decision tree uses the features extracted from the entire dataset. Random forest is a supervised ML classification algorithm that is the average of multiple decision tree results. e features are randomly used while developing decision trees. It uses a vast amount of CART decision trees. e decision trees vote for the predicted class in a random forest model.
Wang et al. [9] proposed a probabilistic method of exploiting information theory tools of entropy and Fano's inequality to predict road traffic pattern and its associated congestion for urban road segments with no prior knowledge on the O-D of the vehicle. ey incorporated road congestion level into time series for mapping the vehicle state into the traffic conditions. As interval influenced the predictability, an optimal segment length and velocity was found. However, with less available data, an increased number of segments increased the predictability. Another traffic parameter, travel time, was used to find CI by Liu and Wu [73]. ey applied the random forest ML algorithm to forecast traffic congestion states. At first, they extracted 100 sample sets to construct 100 decision trees by using bootstrap. e number of feature attributes was determined as the square root of the total number of features. Chen et al. [16] also applied the CART method for prediction and classification of traffic congestion. e authors applied Moran's I method to analyse the spatiotemporal correlation among different road network traffic flow. e model showed effectiveness compared with SVM and K-means algorithm.
Decision tree is a simple classification problem-solving model that can be applied for multifeature data, e.g., Liu and Wu [73] applied weather condition, road condition, time period, and holiday as the input variables. is model's knowledge can be represented in the form of IF-THEN rules, making it an easily interpretable problem. It is also needed to be kept in mind that the classification results are usually binary and therefore, not suitable where the congestion level is required to be known. Other sectors of transport, where decision tree models applied are traffic prediction [74] and traffic signal optimisation with Fuzzy logic [75].

Support Vector Machine.
e support vector machine (SVM) is a statistical machine learning method. e main idea of this model is to map the nonlinear data to a higher dimensional linear space where data can be linearly classified by hyperplane [1]. erefore, it can be very useful in traffic flow pattern identification for traffic congestion prediction. Tseng et al. [13] determined travel speed in predicting realtime congestion applying SVM. ey used Apache Storm to process big data using spouts and bolts. Traffic, weather sensors, and events collected from social media of close proximity were evaluated together by the system. ey categorised vehicle speed into classes and referred them as labels. Speed of the previous three intervals was used to train the proposed model. However, the congestion level categororised from 0 to 100 does not carry a specific knowledge of the severity of the level, especially to the road users. Increment in training data raised accuracy and computational time. is may ultimately make it difficult to make real-time congestion prediction.
Traffic flow shows different patterns based on the traffic mixture or time of the day. SVM is applied to identify the appropriate pattern. Currently modified SVM mostly has its application in other sectors as well, e.g., freeway exiting traffic volume prediction [58], traffic flow prediction [76], and sustainable development of transportation and ecology [77].
Most of the studies compared their developed model with SVM [22,78,79]. Deep machine learning (DML) algorithms showed better results compared to SVM. Table 2 refers to the studies under this section.

Deep Machine
Learning. DML algorithms consist of several hidden layers to process nonlinear problems. e most significant advantage of these algorithms is they can extract features from the input data without any prior knowledge. Unlike SML, feature extraction and model training are done together in these algorithms. DML can convert the vast continuous and complex traffic data with limited collection time horizon into patterns or feature vectors. From last few years, DML has become popular in traffic congestion prediction studies. Traffic congestion studies that used DML algorithms are shown in Figure 8 and discussed in this section.

Convolutional Neural Network.
Convolutional neural network (CNN) is a commonly applied DML algorithm in traffic engineering. Due to the excellent performance of CNN in image processing, while applying in traffic prediction, traffic flow data are converted into a 2-D matrix to process. ere are five main parts of a CNN structure in transportation: the input layer, convolution layer, pool layer, full connection layer, and output layer. Both the convolution and pooling layer extracts important features. e depth of these two layers differs in different studies. Majority of the studies converted traffic flow data into an image of a 2-D matrix. In the studies performed by Ma et al. [80] and Sun et al. [45], each component of the matrix represented average traffic speed on a specific part of the time. While tuning CNN parameters, they selected a convolutional filter size of (3 × 3) and max-pooling of size (2 × 2) of 3 layers according to parameter settings of LeNet and AlexNet and loss of information measurement.
Whereas Chen et al. [68] used a five-layered convolution of filter size of (2 × 2) without the pooling layer. e authors applied a novel method called convolution-based deep neural network modelling periodic traffic data (PCNN). e study folded the time-series to generate the input combining real-time and historical traffic data. To capture the correlation of a new time slot with the immediate past, they duplicated the congestion level of the last slot in the matrix. Zhu et al. [49] also applied five convolutionpooling layers as well as (3 × 3) and (2 × 2) sizes, respectively. Along with temporal and spatial data, the authors also incorporated time interval data to produce a 3-D input matrix. Unlike these studies, Zhang et al. [6] preprocessed the raw data by performing a spatiotemporal cross-correlation analysis of traffic flow sequence data using PCC. en, they applied a model named spatiotemporal feature selection algorithm (STFSA) on the traffic flow sequence data to select the feature subsets as the input matrix. A 2layered CNN with the convolutional and pooling size as same as the previous studies was used. However, STFSA considers its heuristics, biases, and trade-offs and does not guarantee optimality.
CNN shows good performance, where a large dataset is available. It has excellent feature learning capability with less time-consuming classification ability. erefore, CNN can be applied where the available dataset can be converted into an image. CNN is applied in traffic speed prediction [81], traffic flow prediction [6], and modified CNN with LSTM is also applied for traffic prediction [82]. However, as mentioned above, no model depth and parameter selection strategies are available.  (RNN) has a wide usage in the sequential traffic data processing by considering the influence of the related neighbour ( Figure 9). Long short-term memory (LSTM) is a branch of RNN. In the hidden layer of LSTM, there is a memory block that includes four NN layers, which stores and regulates the information flow. In recent years, with different data collection systems with extended intervals, LSTM has become popular. Due to this advantage, Zhao et al. [12] developed an LSTM model consisted of three hidden layers and ten neurons using long interval data. ey set an adequate target and fine-tuned the parameters until the training model stabilized. e authors also applied the congestion index and classification (CI-C) model to classify the congestion by calculating CI from LSTM output data. Most of the studies use the equal interval of CI to divide congestion states. is study did two more intervals of natural breakpoint and geometric interval to find that the latest provided the most information from information entropy. Lee et al. [69] applied 4 layers with 100 neurons LSTM model of 3D matrix input. e input matrix element contained a normalised speed to shorten the training time. While eliminating the dependency, the authors found that a random distribution of target road speed and more than optimally connected roads in the matrix reduced the performance. To eliminate the limitation of temporal dependency, Yuan-Yuan et al. [79] trained their model in the batch learning approach. e instance found from classifying test dataset was used to train the model in an online framework. Some studies introduced new layers to modify the LSTM model for feature extraction. Zhang et al. [83] introduced an attention mechanism layer between LSTM and prediction layer that enabled the feature extraction from a traffic flow data sequence and captured the importance of a traffic state. Di et al. [84] introduced convolution that provides an input to the LSTM model to form the CPM-ConvLSTM model. All the studies applied the one-hot method to convert the input parameters. Adam, stochastic gradient descent (SGD), and leakage integral echo state network (LiESN) are a few optimisation methods applied to fine-tune the outcome.
A few studies combined RNN with other algorithms while dealing with vast parameters of the road network. In this regard, Ma et al. [85] applied the RNN and restricted Boltzmann machine (RNN-RBM) model for networkwide spatiotemporal congestion prediction. Here, they used conditional RBM to construct the proposed deep architecture, which is designed to process the temporal sequence by providing a feedback loop between visible layer and hidden layer. e congestion state was analysed from traffic speed and was represented in binary format in a matrix as input. Also, Sun et al. [45] combined RNN of three hidden layers, with its two other variants: LSTM and gated recurrent unit (GRU). e hidden layers included the memory block characteristic of LSTM, and the cell state and hidden state were incorporated by GRU.
As the sample size is increasing vastly, RNN is becoming popular as a current way of modelling. RNN has a shortterm memory. is characteristic of RNN helps to model nonlinear time series data. e training of RNN is also straight forward, similar to multilayer FNN. However, this training may become difficult due to the conversion in a deep architecture with multiple layers in long-term dependency. In case of long-term dependency problems, LSTM is becoming more suitable to be applied as LSTM can remember information for a long period of time. RNN has its application in other sectors of the transport too, e.g., traffic passenger flow prediction [86], modified LSTM in real time crash prediction [87], and road-network traffic prediction [88].

Extreme Learning Machine.
In recent years, a novel learning algorithm called the extreme learning machine (ELM) is proposed for training the single layer feed-forward neural network (SLFN). In ELM, input weights and hidden biases are assigned randomly instead of being exhaustively tuned. erefore, ELM training is fast. erefore, taking this advantage into account, Ban et al. [19] applied the ELM model for real-time traffic congestion prediction. ey determine CI using the average travel speed. A 4-fold crossvalidation was done to avoid noise in raw data. e model  found optimal hidden nodes to be 200 in terms of computational cost in the study. An extension of this study was done by Shen et al. [78] and Shen et al. [89] by applying a kernel-based semisupervised extreme learning machine (kernel-SSELM) model. is model can deal with the unlabelled data problem of ELM and the heterogenous data influence. e model integrated small-scaled labelled data of transportation personnel and large-scaled unlabelled traffic data to evaluate urban traffic congestion. ELM speeded up the processing time, where kernel function optimized the accuracy and robustness of the whole model. However, realtime labelled data collection was quite costly in terms of human resources and working time, and the number of experts for congestion state evaluation should have been more. Another modification of EML was applied by Yiming et al. [20]. ey applied asymmetric extreme learning machine cluster (S-ELM-cluster) model for short-term traffic congestion prediction by determining the CI. e authors divided the study area and implemented submodels processing simultaneously for fast speed.
e ELM model has the advantage in processing large scale data learning at high speed. ELM works better with labelled data. Where both labelled and unlabelled data are available, semisupervised ELM has shown good prediction accuracy, as it was seen from the studies. Other sectors where ELM was applied included air traffic flow prediction [90], traffic flow prediction [91], and traffic volume interval prediction [92].
Other than the models already discussed, Zhang et al. [93] proposed a deep autoencoder-based neural network model with symmetry of four layers for the encoder and the decoder to learn temporal correlations of a transportation network. e first component encoded the vector representation of historical congestion levels and their correlation. ey then decoded to build a representation of congestion levels for the future.
e second component of DCPS used two dense layers; those converted the output from the decoder to calculate a vector representation of congestion level. However, the process lost information as the congestion level of all the pixels was averaged. is approach needed high iteration and was computationally expensive as all the pixels regardless of roads were considered. Another study applied a generalised version of recurrent neural network named recursive neural network. e difference between these two is, in recurrent NN, weights are shared along the data sequence. Whereas recursive NN is a single neuron model; therefore, weights are shared at every node. Huang et al. [94] applied a recursive NN algorithm named echo state network (ESN).
is model consists of an input layer, reservoir network, and output layer. e reservoir layer constructs the rules that connected prediction origin and forecasting horizon. As the study took a large study area with vast link number, they simplified the training rule complexity applying recursive NN. Table 3 summarises some studies.

Discussion and Research Gaps
Research in traffic congestion prediction is increasing exponentially. Among the two sources, most of the studies used stationary sensor/camera data. Although sensor data cannot capture the dynamic traffic change, frequent change in source makes it complicated to evaluate the flow patterns for probe data [95]. Data collection horizon is an important factor in traffic congestion studies. e small horizon of a few days [3] cannot capture the actual situation of the congestion as traffic is dynamic. Other studies that used data for a few months showed the limitation of seasonality [22,67]. e condition of the surrounding plays an important factor in traffic congestion. A few studies focused on these factors. Two studies considered social media contribution in input parameter [7,13], and five considered weather condition [12,13,27,34,73]. Events, e.g., national event, school holiday, and popular sports events, play a big role in traffic congestion. For example, Melbourne, Australia, has two public holidays before and during two most popular sports events of the country. e authorities close a few traffic routes to tackle the traffic and the parade, resulting in traffic congestion. erefore, more focus must be put in including these factors while forecasting.
Dealing with missing data is a challenge in the data processing. Some excluded the respective data altogether [29], others applied different methods to retrieve the data [59,85], and some replaced with other data [45]. Missing data imputation can be a useful research scope in transportation engineering.
Machine learning algorithm, especially DML models, is developed with time. is shows a clear impact on the rise of their implementation in traffic congestion forecasting ( Figure 10).
Probabilistic reasoning algorithms were mostly applied for a part of the prediction model, e.g., map matching and optimal feature number selection. Fuzzy logic is the most widely used algorithm in this class of algorithms. From other branches, ANN and RNN are the mostly applied models. Most of the studies that applied hybrid or ensembled models belong to probabilistic and shallow learning class. Only two studies applied hybrid deep learning models while predicting networkwide congestion. Tables 4, 5-6 summarize the advantage and weaknesses of the algorithms of different branches.
Among all DML models, RNN is more suitable for time series prediction. In a few studies, RNN performed better than CNN as the gap between the traffic speeds in different classes was very small [12,69]. However, due to little research in traffic congestion field, a lot of new ML algorithms are yet to be applied. SML models showed better results than DML while forecasting traffic congestion in the short-term, as SML can process linearity efficiently and linear features have more contribution to traffic flow in short-term. All the short-term forecasting studies discussed in this article applying SML showed promising results. At the same time, DML models showed good accuracy as these models can handle both linear and nonlinear features efficiently. Besides, real-time congestion prediction cannot afford high computation time.
erefore, models taking a short computational time are more effective in this case.

Future Direction
Traffic congestion is a promising area of research. erefore, there are multiple directions to conduct in future research.
Numerous forecasting models have already been applied in road traffic congestion forecasting. However, with the newly developed forecasting models, there is more scope to make the congestion prediction more precise. Also, in this era of information, the use of increased available traffic data by applying the newly developed forecasting models can improve the prediction accuracy. e semisupervised model was applied only for the EML model. Other machine learning algorithms should be explored for using both labelled and unlabelled data for higher prediction accuracy. Also, a limited number of studies have focused on real-time congestion forecasting. In future, researches should pay attention to real-time traffic congestion estimation problem.
Another future direction can be focusing on the level of traffic congestion. A few studies have divided the traffic congestion into a few states. However, for better traffic management, knowing the grade of congestion is essential. erefore, future researches should focus on this. Besides, most studies focused on only one traffic parameter to forecast congestion for congestion prediction.
is can be an excellent future direction to give attention to more than one parameter and combining the results during congestion forecasting to make the forecasting more reliable.

Methodology Advantages Disadvantages
Artificial neural network (i) It is an adaptive system that can change structure based on inputs during the learning stage [96].
(i) BPNN requires vast data for training the model due to the parameter complexity resulting from its parameter nonsharing technique [97]. (ii) It features defined early, FNN shows excellent efficiency in capturing the nonlinear relationship of data.
(ii) e training convergence rate of the model is slow.
Regression model (i) Models are suitable for time series problems.
(i) Linear models cannot address nonlinearity, making it harder to solve complex prediction problems. (ii) Traffic congestion forecasting problems can be easily solved.
(ii) Linear models are sensitive to outliers.
(iii) ARIMA can increase accuracy by maintaining minimum parameters.
(iv) Minimum complexity in the model. (iv) ARIMA cannot deal multifeature dataset efficiently.
(v) ARIMA cannot capture the rapidly changing traffic flow [8].

Support vector machine
(i) It is efficient in pattern recognition and classification.
(i) e improperly chosen kernel function may result in an inaccurate outcome. (ii) A universal learning algorithm that can diminish the classification error probability by reducing the structural risk [1].
(ii) Unstable traffic flow requires improved prediction accuracy of SVM.
(iii) It does not need a vast sample size.
(iii) It takes high computational time and memory. (ii) iIt can portray more than two states.
(ii) Traffic pattern recognition capability is not as durable as ML algorithms. (iii) As it does not need an exact crisp input, it can deal with uncertainty.
(iii) Traffic state may not match the actual traffic state as the outcome is not exact.

Hidden Markov model
(i) e model can overcome noisy measurements.
(i) Accuracy decreases with scarce temporal probe trajectory data (ii) Can efficiently learn from non-preprocessed data. (ii) Not suitable in case of missing dataset. (iii) Can evaluate multiple hypotheses of the actual mapping simultaneously.
Gaussian mixture model (i) Can do traffic parameter distribution over a period as a mixture regardless of the traffic state.
(i) Optimization algorithm used with GMM must be chosen cautiously.
(ii) Can overcome the limitation of not being able to account for multimodal output by a single Gaussian process.
(ii) Results may show wrong traffic patterns due to local optima limitation and lack of traffic congestion threshold knowledge of the optimisation algorithm.
Bayesian network (i) It can understand the underlying relationship between random variables.
(ii) It can model and analyse traffic parameters between adjacent road links.
(ii) e model performs poorly with the increment in data.
(iii) e model can work with incomplete data.
(iii) e model represents one-directional relation between variables only.

Conclusions
Traffic congestion prediction is getting more attention from the last few decades. With the development of infrastructure, every country is facing traffic congestion problem. erefore, forecasting the congestion can allow authorities to make plans and take necessary actions to avoid it. e development of artificial intelligence and the availability of big data have led researchers to apply different models in this field. is article divided the methodologies in three classed. Although probabilistic models are simple in general, they become complex while different factors that affect traffic congestion, e.g., weather, social media, and event, are considered. Machine learning, especially deep learning, has the benefit in this case. erefore, deep learning algorithms became more popular with time as they can assess a large dataset. However, a wide range of machine learning algorithms are yet to be applied. erefore, a vast opportunity of research in the field of traffic congestion prediction still prevails.

Conflicts of Interest
e authors declare that they have no conflicts of interest. Table 6: e strength and weakness of the models of deep machine learning.

Methodology
Advantages Disadvantages Convolutional neural networks (i) Capable of learning features from local connections and composing them into high-level representation.
(i) Computationally expensive as a huge kernel is needed for feature extraction. (ii) Classification is less time-consuming.
(ii) A vast dataset is required. (iii) Can automatically extract features.
(iii) Traffic data needs to be converted to an image.
(iv) No available strategies are available on CNN model depth and parameter selection.

Recurrent neural network
(i) Shows excellent performance in processing sequential data flow.
(i) Long-term dependency results in bad performance.
(ii) Efficient in sequence classification. (ii) Unlabelled data problem. (iii) Modified models are available to deal with an unlabelled data problem.
(iii) May produce less accurate results.