Bayesian Network-Based Knowledge Graph Inference for Highway Transportation Safety Risks

Accurate inference of knowledge about highway transportation safety risks forms a crucial aspect of building a knowledge graph. Based on the data related to highway transportation accidents, this study has developed a Bayesian network model. The initial identiﬁcation of the network nodes is through expert scoring. The network structure is then constructed by utilizing the prior expert knowledge and K2 greedy search algorithm. Later, the network parameters are trained via the expectation-maximization (EM) algorithm. Finally, knowledge about highway transportation safety risks is inferred using the junction tree algorithm. A comparison is made between the trained conditional and actual probabilities during the network parameter training to verify the validity of the proposed model that accords with expert experience, thereby proving the model validity. Further, its main “causal chain” is inferred to be an improper emergency response-human failure-accident occurrence, where the probability of driver failure is 82%, and the probability of accident occurrence is 68% by taking “a certain road traﬃc accident” as an example. There is consistency between the inference results and the actual accident sequence that suggests the eﬀectiveness of the proposed knowledge inference method.


Introduction
Knowledge graphs that contain abundant knowledge from various areas have become a hot topic across various research fields, with the rapid explosion of network information. As a reliable source of all-round information and underlying support, they have performed remarkably in the safety risk management realms, such as medicine and finance [1,2]. An important area of mass casualty monitoring has always been highway transportation. e top priority of safe highway transportation is safety risk management, and, for this, building a complete knowledge graph of highway transportation safety risk management is necessary for achieving risk control. Crucial to the completion and expansion of knowledge graphs, the knowledge inference technique-among key techniques of building knowledge graphs, such as knowledge extraction, fusion, and inference-deduces new knowledge or identifies errors in a knowledge graph by using certain methods based on the existing knowledge in the graph. e ultimate aim for highway transportation safety risk management is to predict the risks faced in transportation in advance that can be understood as a risk inference problem in knowledge graphs. Extensive research achievements concerning knowledge inference have been made to date: (1) Inferences based on logical rules: e logical rulebased reasoning process divides knowledge into rules and facts. ere are forward, backward, and bidirectional reasoning approaches to inference methods. e Markov logic network (MLN), where the common inference tasks include the maximum posterior probability estimation and the conditional probability reasoning, is a particularly fruitful model. Chen et al. [3] applied the MLN to risk management in response to information security issues. First, they constructed an infrastructure dependency network by the expert method that was then converted into an MLN for the calculation of node weights via a learning algorithm. e proposed model's innovativeness is in the introduction of new network threat factors and the calculation of their possible impacts on the entire system. Yang et al. [4] proposed an MLN-based method for joint sentiment analysis of sentences to address the inadequate utilization of contextual information with the existing knowledge inference methods, as well as the cross-domain connection problem of sentence information, in the context of knowledge graph inference. ey found through experimentation that MLN-based knowledge inference achieved rather desirable results. Liu et al. [5] put forward an ensemble learning-based MLN model, as well as its learning algorithm, in response to the difficulty of MLN in inferring large-scale data, respecting the modification and improvement of MLN models. ey conducted a knowledge extraction experiment in Google's large-scale corpus using this method. As experiments proved, their method had higher precision and recall than the pipelining approach.
(2) Bayesian network-based inference models: Bayesian inference is a process of calculating posterior probability, based on conditional and prior probabilities. Relating to risk management, Lu et al. [6] proposed a Bayesian network-based model for flood risk inference aiming at the problem of flood risk misreporting. Here, expert scoring identified the Bayesian network, and parameter learning was performed with the Monte Carlo model. ese helped in analyzing the changes in flood risk under one-and two-factor uncertainties. e Bayesian network built by them allowed bidirectional reasoning and probability distribution inference of arbitrary nodes. Experiments proved that their model was practically applicable to the risk assessment and control in reservoirs. Based on a qualitative weighted Bayesian network, Yin et al. [7] put forward a human factor inference model for coalmine accidents in order to address the analytical deficiency of human factors in coal mine accident analysis. e Bayesian network, based on typical accident cases, was constructed using the Human Factors Analysis and Classification System (HFACS) model, and the weight magnitudes between network nodes were obtained by the expert method. e mutual logical relationships between the human factors were reflected rather accurately by their model. With a view to combing the evolution process of such events, Xia et al. [8] proposed a dynamic Bayesian networkbased unconventional scenario inference model for sudden disaster events. ey constructed a dynamic Bayesian network with nodes such as the scenario condition, handling goals, handling measures, and intrinsic variables. e prior probabilities were designated for root nodes, while for variables without parent nodes, the expert method was employed to determine conditional probabilities. In the end, inference of accident scenario evolution was executed by introducing the "July 16 Oil Depot Explosion and Fire" that yielded correct inference results. Seyed Hassani et al. [9] developed a Bayesian network related to inference on knowledge graphs by taking into consideration some hidden or ignored information in complex social networks. ey identified the nodes of the Bayesian network, such as comments, avatar information, ensemble photos, or interactive information, and using the collected data, trained its parameters. e effectiveness of their algorithm was tested on Facebook that found the model is highly accurate in finding information between users. Rajabi and Ataie-Ashtiani [10] proposed a fuzzy Bayesian inference method, in terms of model modification and improvement, to address the unavailability of parameter training data in conventional Bayesian inference. e method fused the information provided by experts with the Bayesian network model. ey also developed an algorithm for solving the model. Computational results were compared with the Markov Chain Monte Carlo-(MCMC-) based algorithm that proved the effectiveness of their model and algorithm.
(3) Machine learning-based models: Xie et al. [11] proposed a model that targeted the knowledge completion issue with knowledge graphs based on bag-of-words and convolutional neural network, where the bag-of-words model was used for the vector representation of texts, and the convolutional neural network was responsible for classifying and inferring word relationships. ey confirmed the validity of their model through experimentation. A novel convolution-based model was proposed by Annervaz et al. [12] that extracted relevant prior knowledge from the graphs via an attention mechanism. Experiments on public datasets demonstrated that their method was effective in enhancing the performance of deep learning models. Godin et al. [13] developed a ternary reward framework to cope with the incorrect answering problem in the existing reinforcement learning-based question answering systems that established a new evaluation criterion by setting different rewards for wrong answers and no answers, thereby achieving better evaluation of model effectiveness. All of the above-completion, fast retrieval, and answering problems related to knowledge graphs-can be understood as the scenario applications of knowledge inference. (4) Hybrid models: Jiang et al. [14] developed a method for representing knowledge based on weighted knowledge graphs that was then combined with the probabilistic graphical model to establish a medical diagnosis knowledge network. A path sorting algorithm-based random walk model was proposed by Liu et al. [15] that performed knowledge inference on the semantics of sentences regarding the inverse relationship from object to subject. An exhaustive search was avoided by introducing a random sampling mechanism, and the effectiveness of their knowledge inference model was experimentally verified.
e advantage of logical rule-based reasoning is the intuitive inference procedure that can reflect prior knowledge, and its disadvantage is the difficulty in obtaining rules that lead to error accumulation. e advantage of deep learning-based inference models is a powerful reasoning capability, while their disadvantages are insufficient interpretability and data limitations. Predicting the probability of accident occurrence and estimating the "causal chain" are the foremost tasks of knowledge graph inference for highway transportation safety risks. On the other hand, we consider the task of inferring a knowledge graph about highway transportation safety risks from the Bayesian networking perspective, given the limited data collected in this study. Expert scoring identifies the nodes of the Bayesian network. e node parameters are learned via the EM algorithm, and knowledge about highway transportation safety risks is inferred with the junction tree model. e contributions of this paper include the following: (1) A knowledge inference framework for road transportation risks is proposed, and annotated datasets are provided for the research field. (2) During the identification of network nodes, a network architecture identification method based on expert scoring combined with the K2 algorithm is proposed that ensures the incorporation of expert knowledge about road transportation risks by the network architecture. (3) Based on the Bayesian network created in this paper, the probability distributions of accident occurrence are inferred under multiple factors, including the driver, vehicle, road, environment, and management. e remainder of this paper is organized as follows: Section 2 puts forward a model framework for risk knowledge inference, Section 3 builds a Bayesian networkbased knowledge inference model for road transportation risks, Section 4 infers knowledge about road transportation risks, and Section 5 concludes the study.

Creation of Risk Knowledge Inference Model
e Bayesian network may be summarized as a probabilistic inference network that is based on the Bayesian formula, where the nodes represent the random variables, and the directed edges between nodes represent the internode causality. Each node has a probability distribution. Assuming a given Bayesian network G(S, P) consists of two parts, with S being a directed acyclic graph containing all nodes and P being a collection of conditional probability distribution tables. Construction of Bayesian network for inferring highway transportation safety risk knowledge comprises four steps, namely, (i) the network node identification, (ii) the network architecture identification, (iii) the determination of node conditional probabilities, and (iv) the inference of safety risk knowledge for highway transportation. Overall, the modeling procedure incorporates the expert method and machine learning. Figure 1 presents the model framework proposed in this study.

Network Node Identification.
At this stage, expert scoring identifies the network nodes. Prior expert knowledge is collected according to five categories of "drivers, vehicles, roads, environment, and management," thereby constructing a matrix for node information acquisition as shown in the following expression: where e ij (0 ≤ e ij ≤ 1) denotes the confidence that the ith expert has, regarding the causality between the jth risk and 5]. Matrix E f is traversed, and if e ij ≥ a, it indicates that the jth risk and the fth category are casually correlated.

Network Architecture Identification.
e network architecture identification is divided into two steps, given the limited amount and quality of training data in this study, namely, the expert method based network architecture and the network architecture modification by data-based learning. For network architecture learning, the greedy search and conditional restriction algorithms are generally used, of which the methods based on greedy space search are adapted to minimum data size. K2 algorithm optimizes the search capability on the basis of ordinary greedy search by deleting the redundant edges. Hence, for network architecture learning, this study adopts the K2 algorithm, whose core idea is to find a network structure with high score functions. e K2 scoring method is described in the following formula [16]: where G represents the Bayesian network architecture, D stands for the training dataset, i denotes the ith node, n denotes the number of nodes, j denotes the jth parent node of the current node, q i is the number of parent nodes of the current node, k represents the kth value of the current node, r i represents the number of possible values of the current node, and N ijk denotes the number of examples in the training dataset D that corresponds to the kth value of current node and the jth value of parent node, N ij � r i k�1 N ijk . For the K2 algorithm, its pseudocode is as in Algorithm 1 [16].

Network Parameter Learning.
During network parameter learning, commonly used algorithms include the maximum likelihood estimation (MLE) and the Bayesian estimation and expectation-maximization (EM) algorithms, of which the MLE algorithm is generally suitable for scenarios with large data size. When the sample size is small and Advances in Civil Engineering the prior probability is hardly attainable, the use of the EM algorithm yields a good learning effect. Hence, this study adopts the EM algorithm to learn network parameters. Its algorithmic procedure can be divided into E-step and M-step (see Algorithm 2) [17].

Network Inference.
Network inference is the ultimate objective of this study. All joint probability distributions for nodes are obtained through the network parameter learning stage. e probability (result) distributions for a set of query variables at the network inference stage are computed under exact values given for a set of evidence variables (causes).
Among common algorithms for network inference, such as junction tree and variable elimination, the former is adopted herein for network inference, owing to its easy-tounderstand and accurate inference advantages. e junction tree algorithm is divided into four phases: (1) In the initial phase, the built Bayesian network is modularized, and the parent nodes of each node are connected with undirected edges  Input: training data D, node sequence ρ, positive integer u (u denotes the number of parent nodes) Output: where pred(V i ) represents the nodes before V i and pa(V i ) is a collection of parent nodes. Advances in Civil Engineering (2) e Bayesian network is converted into an undirected graph, where each arrow is replaced by an edge (3) e graph is triangulated, and an edge is added to the variables in the same loop (4) e triangulated graph is converted into a clustering tree, where each node represents factors in the variable subset

Bayesian Network-Based Knowledge Inference Model for Highway Transportation
Safety Risks e purpose of safety risk knowledge inference for road transportation is to achieve the "prior" control of unsafe risk factors. However, these risk factors have strong uncertainties, while Bayesian networks have provided preferable solutions to complex and uncertain problems [16]. Hence, following this idea, a Bayesian network-based knowledge inference model for road transportation risks, with which the inference problems of safety risk knowledge for road transportation are solved, has been proposed in this paper.

Network Node Identification.
Since the focus of this research is on solving the problem of risk knowledge inference, our team has listed a total of 28 risk sources involving "drivers, vehicles, roads [18,19], environment, and management" based on the description of safety risks in the accident reports over the years that have been accomplished by taking specific risks as the network nodes. We developed a quantitative questionnaire and distributed it among ten experts within the field. When e ij ≥ 0.7, a causality is considered to exist between risk and category. A risk has been chosen as an effective node if more than six experts confirmed its causality with the category. By employing expert scoring, the network nodes for road transportation risk knowledge have been constructed according to formula (1) as detailed in Table 1 that was done by selecting a total of 26 observable nodes and 5 virtual nodes.

Network Architecture Identification.
A combination of the K2 algorithm and expert scoring has been used to identify the Bayesian network architecture. Initially, with the aid of GeNIe 2.0 software, learning of network structure was completed based on the database using the K2 algorithm. Figure 2 displays the learned network architecture. Afterward, modification of the network structure was performed based on expert experience. Figure 3 illustrates the finalized network architecture.
As is clear from Figure 2, the network architecture presents a strong progressive causal relationship that was divided into three layers vertically. e relationship between nodes is clear, and there are no connecting lines between unrelated factors. e nodes are divided into two parts, including observable nodes and virtual nodes. e observable nodes are the nodes with actual data description, while the virtual nodes are the nodes added for the integrity of the network.
As is clear from Figure 3, the hierarchy of the network is not obvious, because through the learning of the algorithm, the implicit relationship between the nodes is further revealed, and the overall structure of the network satisfies the requirement of the cognition.

Network Parameter Learning.
Learning of various network parameters was accomplished via the EM algorithm in GeNIe 2.0. e learning process can be summarized into five steps: (1) Assignment of initial values for various root nodes.
(2) Establishment of correspondences between node names and IDs.  Table 2, where N implies normal and F implies failure. Table 2 clearly shows that, under identical conditions, improper emergency response and traffic violations exhibit the highest posterior probabilities among the driver factors, followed by fatigue driving, distracted driving, and drunk driving, while inexperienced driving shows the lowest posterior probability. ese results are in line with expert experience and perception.

Inference of Safety Risk Knowledge for Highway Transportation
Using the junction tree algorithm, Bayesian network inference is implemented on highway transportation safety risks. According to the Bayesian network architecture, Input: observation variable Y, hidden variable Z, joint distribution P(Y, Z|θ), conditional probability distribution P(Z|Y, θ). Output: model parameter θ (1) Assignment of the initial values for the model parameter; (2) E-step: θ i is the value of the model parameter after the ith iteration, the calculated function of expectation on i + 1 th iteration,

Data Description and Model Evaluation.
Our team collected 600 reports on road transportation accidents from the safety management websites that occurred between 2012 and 2019. ese accident reports have been sorted into three categories. e risk factors of each accident are marked either as 0 or as 1, where 1 indicates that the assessed risk is an accident-causing factor, and 0 indicates that the assessed risk is not an accident-causing factor. e model evaluation criteria are as follows: (a) Accordance with expert experience and knowledge of the analysis of the effects of relevant factors on accident occurrence. (b) Conformance to the accident state description of the accident reports in the case analysis.

Effects of Relevant Factors on the Occurrence of
Accidents. e probability distributions of accident occurrence, based on the built Bayesian network model for highway transportation safety risks and the inference results, are depicted in Figure 4 under human, vehicle, road, environment, and management factors.
From Figure 4(a), it is clear that, among the causes of accidents, the accident probability is the highest for the driver factors, with an inferential probability of 0.899, followed by the management factors, with an inferential probability of 0.485. In contrast, road factors constitute the least probable factors, primarily because they are indirect causes in general. According to Figure 4(b), the probability of accidents resulting from the failure of drivers, vehicles, and management is quite high, while that caused by the failure of environmental and road factors is rather low. Figure 5 illustrates the influences of drunk driving, fatigue driving, distracted driving, traffic violations, improper emergency response, and inexperienced driving on the driver risks.

Effects of Relevant Factors on the Driver Risks.
From Figure 5, it is clear that, regarding driver factors, the failure probabilities attributed to improper emergency response and traffic violations are comparatively higher at 0.694 and 0.660, respectively. In comparison, the failure probability caused by inexperienced driving is lower. e occurrence frequencies are the highest among causes of accidents for speeding, illegal lane changing, illegal overtaking, illegal parking, and illegal emergency lane parking, all of which can be classified as forms of a traffic violation. e probability of an accident is generally high in case of improper emergency response. e probability of accidents caused by inexperienced driving is rather low because of the strict management of drivers by road transportation companies.

Effects of Relevant Factors on the Vehicle Risks.
e impacts of the braking system, steering system, light signal system, tires truck overload, and unbalanced load on the vehicle risks are illustrated in Figure 6.
As shown in Figure 6, the failure probabilities attributable to the braking system and tires are the highest among all vehicle factors-0.885 and 0.888, respectively. Braking system faults generally refer to braking failure or poor braking caused by improper operation, wear of parts, or inadequate vehicle clearance, while tire faults often refer to the severe wear or blowout of tires. e lowest failure probability is attributable to light signal system faults because accidents can often be avoided in this case as long as the drivers take appropriate emergency measures.

Effects of Relevant Factors on the Road Risks.
In Figure 7, the influences of roadsides, bridges and tunnels, alignment, intersections, pavements, signs and markings, sight distance, and safety protection facilities on the road risks are detailed. From Figure 7, it is clear that the failure probabilities attributable to pavement and alignment problems are the highest among all road factors-valued at 0.852 and 0.782, respectively. Pavement problems generally refer to slippery surfaces, potholes, and so forth, while alignment problems include sharp bends, steep slopes, and long descents. As these are rather common causes of accidents, the relevant probabilities are also high. In contrast, sight distance is the least likely cause among all road factors because such problems generally involve roadside obstructions or intersections, whose probabilities are rather low according to accident statistics. Figure 8 presents how dynamic monitoring, regulations, and private contracting affect the management risks.

Effects of Relevant Factors on Management Failures.
According to Figure 8, the probability of management problems caused by improper dynamic monitoring is the highest at 0.699. At present, information about fatigued driving, distracted driving, speeding, vehicle trajectories, and so forth can be tracked and forewarned by the dynamic  monitoring system for road transport vehicles. A substantial number of accidents can be avoided by the proper functioning of the vehicle monitoring system. In contrast, the occurrence of an accident is generally attributed to the improper functioning of the dynamic monitoring system. Regulations are the least likely factors causing management failures. Overall, regulatory problems include driving management and inadequate vehicle management that is indirectly responsible for the causes of accidents in general. Figure 9 illustrates how traffic accidents, severe weather, natural disasters, and night lighting affect the traffic environment. From the figure, it is clear that the failure probability attributable to severe weather is the highest among all environmental factors, which is at 0.666. Severe weather mainly refers to conditions such as heavy rain, fog, snow, or wind warnings, which are the most common causes of traffic environment failures. Meanwhile, given the relatively low incidence of natural disasters, the probability of environmental failures resulting from them is the lowest.

Effects of Relevant Factors on the Traffic Environment.
All the above analysis results accord with expert experience and knowledge, suggesting the effectiveness of the created model.

Case-Based Analysis.
Introducing "a certain major road traffic accident" helps in carrying out an empirical analysis of knowledge inference. On February 20, 2019, at 19:07 h, an ordinary passenger bus collided with the sidewall of a tunnel on the 2441 km + 100 m section of the G65 Baotou-Maoming Expressway (within a tunnel in Hezuo Village, Lingui District, Guilin), which caused 4 deaths, 3 serious injuries, 16 minor injuries, and severe vehicle damage. e causes of this major road traffic accident were analyzed as follows: (1) Driver factors: after the bus entered the tunnel, the driver oversped, ignoring the signs and markings. In response to the deceleration of the preceding vehicle, the driver applied emergency brakes, which was an improper emergency response, thereby losing control of his vehicle. (2) Vehicle factor: during steering, the bus suddenly went out of control, so the possibility of vehicle failure cannot be ruled out. (3) Environment factor: there was a heavy shower at the time of the accident that made the road surface slippery. (4) Management factors: due to the long-term, nonlocal, out-of-scope operation of the tourist bus, its dynamic vehicle monitoring terminal had been offline.
Based on the above description, the node elements for this accident are detailed in Table 3.
Based on the Bayesian network built in Figure 3, the nodes relating to accident occurrence are mapped into a Bayesian network, followed by knowledge inference in GeNIe 2.0. As shown in Figure 10, the probabilities are updated for each node. Figure 10 clearly shows that the foremost cause of this accident was the human failure attributed to improper emergency response and traffic violations, whose probability was 82%; the second important cause was the vehicle failure   Advances in Civil Engineering attributable to vehicle fault, with a probability of 58%; besides, the probability of traffic environment problems resulting from severe weather that was 47%; and the probability of management failure resulting from inadequate dynamic monitoring and private contracting was 68%. Ultimately, the probability of occurrence of the accident was 0.68, which is rather high. ese descriptions conform to the actual accident inference process that proves the feasibility of the proposed method.

Conclusions
Highway transportation safety risks involve multiple factors, such as driver, vehicle, road, environment, and management. According to the results of the Bayesian network inference, the primary cause of accidents is driver-related. Meanwhile, management factors are nonnegligible as well. e foremost human factors are improper emergency responses and traffic violations, of which improper emergency responses are often ignored by drivers or managers. Additionally, the foremost vehicle factor is faulty vehicles, which is attributed to the failure of the braking system or tires. Severe weather serves as the foremost factor for traffic environment. Interference from existing accidents is another factor that is often ignored by drivers or managers.
By using the Bayesian network for inference of knowledge graphs oriented to highway transportation safety risks, the rule acquisition problem with rule-based reasoning, as well as the limitations in data size and quality can be avoided. With powerful reasoning, the above method can preferably be integrated with the transportation safety risk knowledge graphs to achieve flexible inference of accident probability in the dynamic risk coupling scenarios. In the next phase, in view of the dynamic nature of such knowledge graphs, we will probe deeper into the collaborative fusion between the Bayesian network-based knowledge inference and the dynamic knowledge graphs.

Conflicts of Interest
e authors declare that they have no conflicts of interest.