A Bayesian Network Model for Origin-Destination Matrices Estimation Using Prior and Some Observed Link Flows

This paper presents a Bayesian network model for estimating origin-destination matrices. Most existing Bayesian methods adopt prior ODmatrixes, which are always troublesome to be obtained. Since transportation systems normally have stored large amounts of historical link flows, a Bayesian network model using these prior link flows is proposed. Based on some observed link flows, the estimation results are updated. Under normal distribution assumption, the proposed Bayesian network model considers the level of total traffic flow, the variability of link flows, and the violation of the traffic flow conservation law. Both the point estimation and the corresponding probability intervals can be provided by this model. To solve the Bayesian network model, a specific procedure which can avoid matrix inversion is proposed. Finally, a numerical example is given to illustrate the proposed Bayesian network method. The results show that the proposed method has a high accuracy and practical applicability.


Introduction
Information about the traffic demand, which commonly means the origin-destination (OD) matrices, has been traditionally used by transportation planning agencies to evaluate the impact of various strategic transportation plans.And the real-time OD matrices are essential for real-time traffic applications, especially in the intelligent transportation system (ITS), such as the real-time route guidance via a dynamic traffic assignment or the evaluation of various ITS deployment alternatives [1,2].
Various methods have been proposed to estimate OD matrices by using aggregate data such as OD demand counts and/or a set of traffic counts observed on the links.Using information derived from traffic counts is very attractive because they are cheap, easy, and immediate data.However, based on these data, we cannot obtain a unique OD matrix because the number of OD pairs is much larger than the number of links in large-scale transportation networks and there are infinite solutions satisfying the conservation law.
In order to have a unique solution which must be close to the actual one, one has to give more information.Normally, people use a prior OD matrix which can be obtained by many different methods, such as an old out-of-date or subjectively guessed OD matrix.These methods for estimating OD matrices can be classified as (1) least squares [3][4][5][6][7][8][9] and generalized least squares [10][11][12] methods, (2) entropy or information based methods [13,14], and (3) statistical based methods.
Providing variability information of the traffic flow estimation is the most important advantage of the statistical methods.Normally, other methods give only the particular values of the OD and link flows, while statistical methods could also provide the corresponding probability intervals.The statistical methods can be categorized as follows.(1) Classical methods [15][16][17]: the traffic flows are assumed multivariate random variables given some parametric families, such as Poisson, Gamma, and multivariate normal.Then, the problem reduces to estimating the parameters and becomes a standard statistical problem.(2) Bayesian methods [18][19][20][21]: these methods also consider parametric families of distributions, but the parameters are considered as random variables themselves.Particularly, among the Bayesian methods, using Bayesian network [22][23][24] can easily know the relationship of all the variables (link flows and OD flows) and then simplify the calculation.
Whether using prior information (historical information or experience) or not is the main difference between Bayesian methods and classical statistical methods.In the Bayesian methods, based on some prior information, the prior distribution of some parameters or variables can be determined.
Then by updating the sample information (observed information), we can derive the posterior distribution, which is the fundamental inferential tool of the Bayesian methods.
Generally, the quality of the prior information can affect the accuracy of the estimation when using the Bayesian methods.The prior information used by almost all existing Bayesian methods for estimating OD matrices is a prior OD matrix.However, it is difficult to guarantee the accuracy of the prior OD matrix, which is outdated or subjectively guessed.Moreover, it is even impossible to get a prior OD matrix in some cases, especially in a newly developed city.
In reality, there are usually large amounts of historical link flow data stored in the cities' transportation system data base.Compared with a prior OD matrix, prior (historical) link flows are more accurate as they were obtained by traffic detectors or manual investigation.Therefore, in this paper, in order to estimate OD matrices, we propose a Bayesian network (BN) method using prior link flows and a set of new observed link flows.Based on these prior link flows, we can derive the prior distribution of link flows and OD flows.Then, by updating a set of observed link flows, we can modify the means and reduce the variances of the remaining variables.Using these updated means and variances, we can obtain the posterior distribution of all the variables.Based on the posterior distribution, both the point estimation and the corresponding probability intervals can be provided.
Note that the level of total traffic flow varies randomly and deterministically in similar situations (vacation, peak hour, special weather conditions, etc.) [20,24].So the proposed BN model also considers the level of total traffic flow, which is very useful for many real-time traffic applications.In addition, the BN model also considers the variability of link flows and the violation of the conservation law.
The rest of the paper is organized as follows.Section 2 briefly introduces Bayesian network and Gaussian Bayesian networks.In Section 3, the proposed BN model for estimating OD matrices and its main assumptions are described.In Section 4, using the Bayesian network model, a specific procedure for estimating OD matrices is proposed.In Section 5, a numerical example is provided to illustrate the proposed model and clarify some of its implementation details.Finally, some conclusions are provided in Section 6.

Bayesian Network and Gaussian Bayesian Network
In this section, we briefly review the Bayesian network and Gaussian Bayesian network, which are the basic tools of this paper.

Definition 1 (Bayesian network
The graph G contains all the qualitative information about the relationships among the variables.As a supplement, the probabilities in P quantify the qualitative information in graph G. In Bayesian networks, the factorization of JPD implied by (1) is normally very simple and the conditional independence relations among variables can be inferred directly from the graph G, which makes the evidence propagation easy.Due to these advantages, Bayesian network models have been used widely to solve a large variety of practical problems [25,26].
Bayesian networks can be applied to many distributions.For the sake of illustration, we consider the important and particular case of Gaussian Bayesian networks, in which the traffic flows distribution is supposed to be a normal distribution.A normal distribution for traffic flows is reasonable, because these random variables are the sum of a great number of independent Bernoulli experiments in which the users decide where to travel and which routes to choose.In the literature, Gaussian Bayesian networks have been used frequently [24,27].
Definition 2 (Gaussian Bayesian network).A Bayesian network (G, P) is said to be a Gaussian Bayesian network if and only if the joint probability distribution (JPD) associated with its variables X is a multivariate normal distribution, (, Σ), that is, with joint probability density function: where  is the mean vector, Σ is the  ×  covariance matrix, |Σ| is the determinant of Σ, and   is the transpose of .
The JPD of the variables in a Gaussian Bayesian network can be specified as in (1) by the product of a set of CPDs, whose joint probability density function is where   is the regression coefficient of   in the regression of   on its parents   .And the conditional variance of   is where Σ  is the unconditional variance of   , Σ   is the covariance matrix between   and the variables in   , and Σ   is the covariance matrix of   .

Proposed Bayesian Network Model
Since Bayesian network has so many advantages as introduced in Section 2, in this section we propose a Bayesian network (BN) model to reproduce the probabilistic structure of link and OD flows.

Model Assumptions.
Assuming we have some prior (historical) link flows, in order to give the prior distribution of the link flows, we make the following assumptions.
Assumption 3. The link flows are given by Assumption 4. The variable  is a normal random variable with mean   and variance-covariance matrix  2  , where  is a normal random variable and measures the level of total mean flow.It reflects that traffic flows vary randomly and deterministically in similar situations (vacation, peak hour, special weather conditions, etc.).K is a vector, whose elements measure the relative weights of link flows with respect to the total traffic flow;  is a vector of independent normal random variables with zero mean; and   measures the discrepancy of the flow of link  with respect to its mean.
Note that traffic flows vary randomly and deterministically in similar situations (vacation, peak hour, special weather conditions, etc.) [20,24].Assumption 3 can take this into account.The distribution of  varies with the situation.Based on prior link flows and considering the similar situation, the distribution of  and the initial vector K are determined.Then we can easily derive the prior distribution of link flows, which will be shown later.Assumption 4 is a normal assumption, which is also adopted in Maher [18], Hazelton [20], Castillo et al. [24], and so forth.
To give the prior distribution of OD flows, we first consider the well-known conservation law equation: where   and   are the flows of OD pair  and link , respectively.   is the incidence element; that is, it takes value 1 if link  belongs to route  of OD pair  and 0, otherwise.  is the proportion of users from OD pair  choosing route .In this paper, the route choice proportions are defined by a logit model as follows: where  is a parameter measuring travelers' sensitivity to the cost difference between routes;    is the cost associated with route  of OD pair .
Equation ( 6) can be written as Set   = ∑       to represent the proportions of users from OD pair  choosing link .Then (8) can be rewritten in the form of matrix as Matrix D is not necessarily reversible because it is not necessarily a square.So we do the following conversion of ( 9): If matrix D  D is of full rank, it is reversible.Then, (10) can be written as Set  = (D  D) −1 D  and according to (11), we make the following assumption.
Assumption 5.The OD flows are given by where  = ( 1 ,  2 , . . .,   ) are mutually independent normal random variables with mean (  ) and variance  2  .The variables  represent OD flows apart from those using links of the considered network.Setting all the variables of  to be null or evaluating their values, the conservation law equation can be satisfied.

The Complete Model.
Based on Definitions 1 and 2, in order to complete our BN model, we need to define an associated graph G.For example, consider the simple network shown in the left of Figure 1, which has 2 nodes, 2 links, and 1 OD pair 1-2.The right of Figure 1 shows the associated Bayesian network.The link flow node V  has as parents the corresponding node  and   .The OD flow node   has as parents the corresponding node V  and   .
Then according to Assumptions 3 and 4, we get the variance-covariance matrix of V: where Σ (U,) and D  are diagonal matrixes.D  is the variancecovariance matrix of .
Based on Assumption 5, we get Then, the mean [(V, T)] is And the variance-covariance matrix of (V, T) is where D  is the variance matrix of .
In summary, all random variables involved in our model are related by the linear expression:

𝜂 𝜀
) . ( The mean [(, , , V, T)] is And the variance-covariance matrix Then, the prior distribution (joint probability density function) of all the variables can be given as

Estimating OD Matrices Using the Proposed BN Model
In this section, using the proposed BN model, we describe how to estimate the OD matrices when some new observed link flows are available.Since we have obtained the prior distribution of all the variables, we can use the following equations to update the mean and the covariance matrix of the variables [23,24] based on some observed variables.Note that one only needs to consider the unobserved variables conditioned on the observed variables and then update the expected values and covariance of the remaining variables.These equations are where Y and Z are the sets of unobserved and observed variables, respectively;  Y and Σ YY are the mean vector and covariance matrix of Y;  Z and Σ ZZ are the mean vector and covariance matrix of Z; and Σ YZ is the covariance matrix of Y and Z.
Given a set of evidential nodes Z whose values are known to be Z = , by ( 21) and ( 22), we can derive the mean vector and covariance matrix of the unobserved nodes in Y. Thus, the conditional distribution of Y can be obtained.Equations ( 23) and (24) state that the expected values of the observed variables coincide with their observed values and their variances and covariances are null.In order to simplify the calculation, we can use an incremental method, that is, updating evidence from Z one by one.Thus, we do not need to calculate the matrix inverse operation, because the matrix degenerates to a scalar.In this case, Σ YZ is a column vector and Σ ZZ is a scalar (i.e., Σ −1 ZZ = 1/  ).If we want to give the point estimation as well as the corresponding probability intervals, we can solve the following maximum posterior distribution problem to get the point estimation, whose results are normally the conditional means: where Z is the set of the observed variables, including those observed link flows and/or OD flows.In summary, the specific procedure for estimating OD matrices and those unobserved link flows is given as follows.
Step 0. Initialize the model.According to Assumptions 3 and 4, based on prior (historical) link flows, we can determine the distribution of  and the initial matrix K. Then we can obtain the initial link flows V = ( 1 ,  2 , . . .,   ) = K().Thus the initial route choice proportion   is calculated as follows: where (26) is the link cost function, where ℎ   is the cost associated with free flow conditions,   is the link capacity, and   and   are constants defining how the cost increases with traffic flow; (27) is the route cost function; (28) calculates the route choice proportion defined in (7).
Step 1. Solve the BN model.According to model assumptions, using the initial route choice proportion matrix P, we can get the prior distribution of traffic flows (prior means and variances) using the following formulas: where ( 29) is for calculating the regression coefficient matrix given in (12).Equations ( 30) and (31) are for calculating the means of V and T given in (17).Equation (32) defines the diagonal variance matrix of ; that is, Var(  ) = ((V  ) × ]) 2 , where V is the coefficient of variation.Equations (33) to (36) define the variance-covariance matrix in (19).
Step 2. Update the observed link flows, using the formulas where (37) and (38) are for updating the means and variancecovariance matrix of the unobserved variables, where Y and Z refer to the unobserved and observed components of (T,V), respectively.Equations (39) and (40) state that the expected values of the observed variables are their observed values and their variances and covariances are zero, as given in ( 23) and (24).Equation (41) takes the conditional means as point estimation for the OD and link flows, as the results of the maximum posterior distribution problem given in (25).
Step 3. Calculate the new route choice proportions.Since link flows are updated in Step 2, the route choice proportions also need to be updated.Given that the matrix T is obtained by (41), the new route choice proportion  *  is calculated using the following expressions: where ( 42) is the conservation law equation given in (6).
Step 4. Test convergence.If Σ , (  −  *  ) 2 < , where  is a small number to control convergence of the process, then stop the process and return the OD flow   , the link flow   , and the route choice proportion  *  .Otherwise, continue with Step 5.
Step 5. Update route choice proportions and the matrix K, using the expressions and go to Step 1, where (46) is for updating the route choice proportion matrix, where , 0 <  < 1, is a relaxation factor; (47) is for updating the matrix K.The values of variables V are obtained by (42).

Example: The Nguyen-Dupuis Network
In this section, we illustrate the proposed methods using the well-known Nguyen-Dupuis network, shown in Figure 2. It consists of 13 nodes, 19 links, and 4 OD pairs: 1-2, 1-3, 4-2, and 4-3.The network data are shown in Table 1 and the associated parameters in ( 26) are assumed to be   = 0.15,   = 4 for any link.
The assumed true OD matrices, which are used later for testing the quality of the estimation, are shown in Table 4 under the heading "True flow." The true link flows are obtained by solving the multinomial logit assign model with parameter  = 1.0 for the stochastic loading.

(48)
The observed link flows are assumed to be  5 = 82.57, 7 = 87.38, 10 = 48.07, 13 = 58.66,and  18 = 37.12.And it is supposed that they are known in this order.Since they are observed, their values are equal to the true link flows (as shown in Table 4).
Step 0-Step 1. Initialize and give the prior distribution of all the variables.
Based on the prior information, we can get the prior distribution of the traffic flows.The prior means and variances are shown in the second columns of Tables 2 and 3, respectively.To simplify the calculation, in this example, the expectation and variance-covariance of  are assumed to be null (i.e., there is no uncertainty in the conservation law).In addition, to obtain the variance-covariance matrix D  , we have selected V = 0.1 in (32).
Step 2-Step 5. Give the posterior distribution by updating the observed link flows one by one.Table 3 shows how the variances of the traffic flows changed after updating the observed link flows one by one.Note that after some link flows (including the observed link flows and those derived from the observed link flows and the conservation laws) are known, their means remain constant and their variances become null (boldfaced in the tables).And normally the variances of the unknown variables (OD flows and those unobserved link flows) decrease with each update.Note that the smaller the variances, the higher the precision of the estimation.So after a series of update, the estimation becomes more accurate.This derives a method to determine how many links and what links need to be observed when estimating traffic flows by the Bayesian network model, that is, the network sensor location problem (NLSP) [28].Note that the variances updating equation (22) has no relevance with the values of the observed link flows.So we can solve NLSP without observing any link.First, by the Bayesian network model, we can get the prior distribution of all the variables shown in (19).Next, we can take the link which can reduce the variance of the OD flows maximally by updating as the first observed link.Then we update the variances of traffic flows, determine the second observed link, and iterate until the variances decrease to meet the requirement of the estimation precision or until the budget exceeds the constraint.By the updated means (point estimation) and variances, we can obtain the posterior distribution of OD flows and those unobserved link flows.Figure 3 illustrates how the marginal densities of OD flows and those unobserved link flows evolve from their initial form to their final form (boldfaced) by updating the observed link flows one by one.It can be seen that the variances of the unknown variables are normally decreasing with each update.
In summary, according to Tables 2 and 3 and Figure 3, using the proposed Bayesian network method, after some variables are observed, the means of these observed remain constant and their variances become null.For the remaining variables (those unobserved), their variances are normally decreasing with each update.The proposed method can provide a control of the conservation law as well.The final forms (boldfaced) in Figure 3 are the posterior densities of OD flows and those unobserved link flows.These posterior densities supply complete statistical information about the unknown variables.By these posterior densities, we can provide the point estimation as well as the corresponding probability intervals.
In order to test the quality of the estimation, Table 4 compares the true flow and the point estimation of the proposed BN model.Because  5 ,  7 ,  9 ,  10 ,  11 ,  13 ,  14 ,  15 ,  16 ,  18 , and  19 are observed, their values are equal to the true flow.And studying the OD pairs and those unobserved links (boldfaced in the table), it can be seen that the estimation and true flow are basically the same.The relative errors are all very small and the maximum relative error value of the OD flows estimation is only 4.70%.This illustrates that the proposed BN model has a high accuracy.

Conclusions
In this paper, we use a Bayesian network model to estimate origin-destination matrices based on prior link flows and a set of observed link flows.Normally, large amounts of historical link flows are stored in the cities' transportation system.Compared with an outdated or subjectively guessed prior OD matrix, prior link flows are more accurate as they are obtained by traffic detectors or manual investigation.The proposed Bayesian network model can make use of these historical link flows and also consider the level of total traffic flow, which is really useful for many real-time traffic applications, especially in the ITS.
Using the Bayesian network model and updating the observed variables (including the observed link flows and those derived from the observed link flows and the conservation law) can modify the means and reduce the variances of the remaining variables.These updated means and variances allow us to obtain the posterior distribution of the unobserved variables based on those observed.Thus, the methods can provide not only point estimation but also the corresponding probability intervals.In addition, an incremental procedure is developed for solving the Bayesian network model without the intensive computation of matrix inversion, which can make this model apply easily in largescale networks.
Moreover, in this paper, a normal distribution for traffic flows is assumed.It is reasonable because these random variables are the sum of a great number of independent Bernoulli experiments in which the users decide where to travel and which routes to choose.For future research, it is worthwhile to relax the normal distribution assumption.

Figure 3 :
Figure 3: Conditional distributions of the OD flows and the unobserved link flows.
= {( 1 |  1 ), . . ., (  |   )} is a set of  conditional probability densities (CPDs), and   is the set of parents of node   in G.The set P defines the associated joint probability density (JPD) as

Table 1 :
Network parameters of the Nguyen-Dupuis network.

Table 2 :
Point estimation of traffic flows after updating observed link flows one by one.

Table 2
shows how the means of the traffic flows changed after updating the observed link flows one by one.After each update, the point estimation of link flows { 1 ,  2 , . . .,  19 } and OD flows { 1 ,  2 , . . .,  4 } is provided.It can be seen that once  7 and  10 become known and updated, the point estimation of  9 in Table 2 remains unchanged and its variance in Table 3 becomes zero (boldfaced in the table).Because, due to the flow conservation in node 6, once  7 and  10 become known,  9 becomes known.Similarly,  19 becomes known once  13 is given;  16 becomes known once  19 is given;  11 becomes known once  9 and  18 are given;  15 becomes known once  11 is given;  14 becomes known once  10 ,  15 , and  16 are given.In other words, due to the conservation law, the observed link flows are known in this order:  5 = 82.57, 7 = 87.38, 10 = 48.07, 9 = 39.31, 13 = 58.66, 19 = 58.66, 16 = 41.24, 18 = 37.12,  11 = 76.43, 15 = 23.57, and  14 = 16.84.

Table 3 :
Variances of traffic flows after updating observed link flows one by one.

Table 4 :
The true flow and the point estimation from the proposed method.